Atari env

Question

Atari env

hlsafin opened this issue a year ago · comments

Hello, I attempted to set up an Atari environment and train using your current code, but unfortunately, I was unable to learn anything during training. Could you suggest any possible explanations for this and recommend specific hyperparameters that I could experiment with for Pong?

Marco Pleines · Answer 1 · Tue Mar 21 2023 00:28:04 GMT+0800 (China Standard Time)

Hi @hlsafin
which hyperparameters did you initially try? Could you check the tensorboard summary for vanishing gradients?

hlsafin · Answer 2 · Tue Mar 21 2023 00:33:42 GMT+0800 (China Standard Time)

Here is my yaml file. I couldn't go too crazy with the architecture because I'm using a RTX 3080 and the VRAM is ~16G
I didn't record tensorboard because I was just doing initial tests where I printed the rewards when the game ends and essentially all the rewards were around 0-2 after 8 hours of training with this setup.
environment: type: "Atari" name: BreakoutNoFrameskip-v4 gamma: 0.995 lamda: 0.95 updates: 200000 epochs: 4 n_workers: 2 worker_steps: 128 n_mini_batch: 8 value_loss_coefficient: 0.5 hidden_layer_size: 128 max_grad_norm: 0.5 transformer: num_blocks: 3 embed_dim: 128 num_heads: 4 memory_length: 64 positional_encoding: "relative" # options: "" "relative" "learned" layer_norm: "post" # options: "" "pre" "post" gtrxl: False gtrxl_bias: 0.0 learning_rate_schedule: initial: 3.5e-4 final: 1.0e-4 power: 1.0 max_decay_steps: 250 beta_schedule: initial: 0.001 final: 0.001 power: 1.0 max_decay_steps: 10000 clip_range_schedule: initial: 0.1 final: 0.1 power: 1.0 max_decay_steps: 10000

Marco Pleines · Answer 3 · Tue Mar 21 2023 00:43:25 GMT+0800 (China Standard Time)

My shallow take on this is too shrink down the number of transformer blocks to 1, while increasing the embedding and hidden layer size to 256 or 384. If possible increase the overall batch size by utilizing more workers.

It is useful to monitor the gradients. If vanishing gradients occur, you should change layer norm from "post" to "pre".

You should also be able to reduce the memory length to like 16. Pong is rather a short-term memory problem. Theoretically a memory length of 4 should suffices as this is the number frames that are usually stacked to solve this environment.

Check your learning rate schedule. I'd suggest to go with a constant learning rate of 2.0e-4 for now. The beta schedule (entropy bonus) could be inspired by related work that train pong with frame stacking and PPO.

There is another possibility to save memory, you could collect all training data on CPU, while mini batches are pushed to the GPU one at a time for optimization. This should allow for larger batches, which is quite helpful.

Hope this helps.

hlsafin · Answer 4 · Tue Apr 18 2023 03:22:49 GMT+0800 (China Standard Time)

Yeah, still I am not getting any good results. I might be doing something wrong here. who knows.

Marco Pleines · Answer 5 · Tue Apr 18 2023 12:00:34 GMT+0800 (China Standard Time)

Could you provide a tensorboard summary?

hlsafin · Answer 6 · Tue Apr 18 2023 13:04:16 GMT+0800 (China Standard Time)

hlsafin commented a year ago

hlsafin · Answer 7 · Tue Apr 18 2023 13:06:11 GMT+0800 (China Standard Time)

and this is breakout

Marco Pleines · Answer 8 · Tue Apr 18 2023 13:06:32 GMT+0800 (China Standard Time)

I mean the entire summary file as I'm interested in all monitored stats. Your current training config would be helpful as well.

hlsafin · Answer 9 · Tue Apr 18 2023 13:08:16 GMT+0800 (China Standard Time)

hlsafin commented a year ago

hlsafin · Answer 10 · Tue Apr 18 2023 13:09:26 GMT+0800 (China Standard Time)

hlsafin commented a year ago

Marco Pleines · Answer 11 · Tue Apr 18 2023 13:18:41 GMT+0800 (China Standard Time)

Just attach the summary file to this issue via drag and drop.
Please provide your current config.

By looking at the screenshots the norm of the value function's gradients is notably sticking out.
The value functions suffers from the vanishing gradient issue.
Also, the monitored value mean sticks around zero.
There most be something wrong with your Atari environment wrapper.
Could it be that the step function does not properly return the reward?
The tensorboard summary shows the return that is part of the info dictionary. So there could be an issue with your step() function.

Could you provide your environment code?

hlsafin · Answer 12 · Tue Apr 18 2023 13:29:18 GMT+0800 (China Standard Time)

import gym
import numpy as np
import time
import collections

class MaxAndSkipEnv(gym.Wrapper):
def init(self, env=None, skip=1):
"""Return only every skip-th frame"""
super(MaxAndSkipEnv, self).init(env)
# most recent raw observations (for max pooling across time steps)
self._obs_buffer = collections.deque(maxlen=2)
self._skip = skip

def step(self, action):
    total_reward = 0.0
    done = None
    for _ in range(self._skip):
        obs, reward, done, info = self.env.step(action)
        self._obs_buffer.append(obs)
        total_reward += reward
        if done:
            break
    max_frame = np.max(np.stack(self._obs_buffer), axis=0)
    return max_frame, total_reward, done, info

def reset(self):
    """Clear past frame buffer and init. to first obs. from inner env."""
    self._obs_buffer.clear()
    obs = self.env.reset()
    self._obs_buffer.append(obs)
    return

class Atari:
def init(self, env_name ):
self._env = gym.make(env_name)
#self.max_episode_steps = self._env.spec.max_episode_steps
self.max_episode_steps = 108000
#self._env = MaxAndSkipEnv(self._env, skip=4)
self._env = gym.wrappers.ResizeObservation(self._env, (int(84/1), int(84/1)))
self._env = gym.wrappers.GrayScaleObservation(self._env )
self._env = gym.wrappers.FrameStack(self._env, 1)

@property
def observation_space(self):
    return self._env.observation_space

@property
def action_space(self):
    return self._env.action_space

def reset(self):
    self._rewards = []
    obs = self._env.reset()
    obs = np.stack(obs[0]._frames)
    return obs 

def step(self, action):
    #obs, reward, done, info = self._env.step(action[0])
    obs, reward, done,_, info = self._env.step(action[0])
    obs = np.stack(obs._frames)
    self._rewards.append(reward)
    if done:
        info = {"reward": sum(self._rewards),
                "length": len(self._rewards)}
        print(sum(self._rewards))
    else:
        info = None
    return obs , reward / 100.0, done, info

def render(self):
    self._env.render()
    time.sleep(0.033)

def close(self):
    self._env.close()

Marco Pleines · Answer 13 · Tue Apr 18 2023 13:36:38 GMT+0800 (China Standard Time)

Did you test your wrapper and print reward / 100.0, obs, and done?

self.max_episode_steps = 108000
This is way too large and will consume too much memory. Lower this to like 512.

hlsafin · Answer 14 · Tue Apr 18 2023 13:41:27 GMT+0800 (China Standard Time)

noted, I can bring that down to 512 and test it again, not sure if that was the main cause. I can do further tests of the reward / 100.

Marco Pleines · Answer 15 · Tue Apr 18 2023 13:43:22 GMT+0800 (China Standard Time)

Max episode steps won't be the cause for bad training results. This makes the runtime less efficient.

Most importantly verify your environment. Based on the tensorboard summary it really looks like that the agent receives always 0 as reward.

hlsafin · Answer 16 · Tue Apr 18 2023 13:47:07 GMT+0800 (China Standard Time)

Certainly, this might be the case I will change it to "return obs, reward, done, info" instead and see if this makes a difference. I will do another run later tonight, and see what happens. I'll keep you posted
Thank you

Marco Pleines · Answer 17 · Tue Apr 18 2023 13:48:54 GMT+0800 (China Standard Time)

You can write a script that runs your environment wrapper using random actions. This way, you can more easily debug and verify your environment.

hlsafin · Answer 18 · Tue Apr 18 2023 15:23:41 GMT+0800 (China Standard Time)

now i am getting this odd error if self.max_episode_steps = 512. so I changed it to self.max_episode_steps = 10000

"/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [0,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [1,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [2,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [3,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [4,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [5,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [6,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [7,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [8,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [9,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [10,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [11,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [12,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [13,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [14,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [15,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [16,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [17,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [18,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [19,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [20,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [21,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [22,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [23,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [24,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [25,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [26,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [27,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [28,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [29,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [30,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [31,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed."

Marco Pleines · Answer 19 · Tue Apr 18 2023 15:29:01 GMT+0800 (China Standard Time)

What is your memory_length? It cannot be greater than max_episode_steps.

hlsafin · Answer 20 · Tue Apr 18 2023 23:23:07 GMT+0800 (China Standard Time)

memory_length is 16. Okay, I did another training last night here are the results. I zipped and uploaded the summary here.
summaries.tar.gz

here is the yaml file:
environment: type: "Atari" name: PongNoFrameskip-v4 gamma: 0.995 lamda: 0.95 updates: 200000 epochs: 4 n_workers: 4 worker_steps: 128 n_mini_batch: 8 value_loss_coefficient: 0.5 hidden_layer_size: 384 max_grad_norm: 0.5 transformer: num_blocks: 1 embed_dim: 128 num_heads: 4 memory_length: 16 positional_encoding: "relative" # options: "" "relative" "learned" layer_norm: "pre" # options: "" "pre" "post" gtrxl: False gtrxl_bias: 0.0 learning_rate_schedule: initial: 3.5e-4 final: 1.0e-4 power: 1.0 max_decay_steps: 250 beta_schedule: initial: 0.001 final: 0.001 power: 1.0 max_decay_steps: 10000 clip_range_schedule: initial: 0.1 final: 0.1 power: 1.0 max_decay_steps: 10000

hlsafin · Answer 21 · Wed Apr 19 2023 16:06:39 GMT+0800 (China Standard Time)

okay, sorry! I fixed the issue, it was definitely in the setup of the environment.

Marco Pleines · Answer 22 · Wed Apr 19 2023 17:49:43 GMT+0800 (China Standard Time)

Glad you found it!