Atari env
hlsafin opened this issue · comments
Hello, I attempted to set up an Atari environment and train using your current code, but unfortunately, I was unable to learn anything during training. Could you suggest any possible explanations for this and recommend specific hyperparameters that I could experiment with for Pong?
Hi @hlsafin
which hyperparameters did you initially try? Could you check the tensorboard summary for vanishing gradients?
Here is my yaml file. I couldn't go too crazy with the architecture because I'm using a RTX 3080 and the VRAM is ~16G
I didn't record tensorboard because I was just doing initial tests where I printed the rewards when the game ends and essentially all the rewards were around 0-2 after 8 hours of training with this setup.
environment: type: "Atari" name: BreakoutNoFrameskip-v4 gamma: 0.995 lamda: 0.95 updates: 200000 epochs: 4 n_workers: 2 worker_steps: 128 n_mini_batch: 8 value_loss_coefficient: 0.5 hidden_layer_size: 128 max_grad_norm: 0.5 transformer: num_blocks: 3 embed_dim: 128 num_heads: 4 memory_length: 64 positional_encoding: "relative" # options: "" "relative" "learned" layer_norm: "post" # options: "" "pre" "post" gtrxl: False gtrxl_bias: 0.0 learning_rate_schedule: initial: 3.5e-4 final: 1.0e-4 power: 1.0 max_decay_steps: 250 beta_schedule: initial: 0.001 final: 0.001 power: 1.0 max_decay_steps: 10000 clip_range_schedule: initial: 0.1 final: 0.1 power: 1.0 max_decay_steps: 10000
My shallow take on this is too shrink down the number of transformer blocks to 1, while increasing the embedding and hidden layer size to 256 or 384. If possible increase the overall batch size by utilizing more workers.
It is useful to monitor the gradients. If vanishing gradients occur, you should change layer norm from "post" to "pre".
You should also be able to reduce the memory length to like 16. Pong is rather a short-term memory problem. Theoretically a memory length of 4 should suffices as this is the number frames that are usually stacked to solve this environment.
Check your learning rate schedule. I'd suggest to go with a constant learning rate of 2.0e-4 for now. The beta schedule (entropy bonus) could be inspired by related work that train pong with frame stacking and PPO.
There is another possibility to save memory, you could collect all training data on CPU, while mini batches are pushed to the GPU one at a time for optimization. This should allow for larger batches, which is quite helpful.
Hope this helps.
Yeah, still I am not getting any good results. I might be doing something wrong here. who knows.
Could you provide a tensorboard summary?
I mean the entire summary file as I'm interested in all monitored stats. Your current training config would be helpful as well.
Just attach the summary file to this issue via drag and drop.
Please provide your current config.
By looking at the screenshots the norm of the value function's gradients is notably sticking out.
The value functions suffers from the vanishing gradient issue.
Also, the monitored value mean sticks around zero.
There most be something wrong with your Atari environment wrapper.
Could it be that the step function does not properly return the reward?
The tensorboard summary shows the return that is part of the info dictionary. So there could be an issue with your step() function.
Could you provide your environment code?
import gym
import numpy as np
import time
import collections
class MaxAndSkipEnv(gym.Wrapper):
def init(self, env=None, skip=1):
"""Return only every skip
-th frame"""
super(MaxAndSkipEnv, self).init(env)
# most recent raw observations (for max pooling across time steps)
self._obs_buffer = collections.deque(maxlen=2)
self._skip = skip
def step(self, action):
total_reward = 0.0
done = None
for _ in range(self._skip):
obs, reward, done, info = self.env.step(action)
self._obs_buffer.append(obs)
total_reward += reward
if done:
break
max_frame = np.max(np.stack(self._obs_buffer), axis=0)
return max_frame, total_reward, done, info
def reset(self):
"""Clear past frame buffer and init. to first obs. from inner env."""
self._obs_buffer.clear()
obs = self.env.reset()
self._obs_buffer.append(obs)
return
class Atari:
def init(self, env_name ):
self._env = gym.make(env_name)
#self.max_episode_steps = self._env.spec.max_episode_steps
self.max_episode_steps = 108000
#self._env = MaxAndSkipEnv(self._env, skip=4)
self._env = gym.wrappers.ResizeObservation(self._env, (int(84/1), int(84/1)))
self._env = gym.wrappers.GrayScaleObservation(self._env )
self._env = gym.wrappers.FrameStack(self._env, 1)
@property
def observation_space(self):
return self._env.observation_space
@property
def action_space(self):
return self._env.action_space
def reset(self):
self._rewards = []
obs = self._env.reset()
obs = np.stack(obs[0]._frames)
return obs
def step(self, action):
#obs, reward, done, info = self._env.step(action[0])
obs, reward, done,_, info = self._env.step(action[0])
obs = np.stack(obs._frames)
self._rewards.append(reward)
if done:
info = {"reward": sum(self._rewards),
"length": len(self._rewards)}
print(sum(self._rewards))
else:
info = None
return obs , reward / 100.0, done, info
def render(self):
self._env.render()
time.sleep(0.033)
def close(self):
self._env.close()
Did you test your wrapper and print reward / 100.0
, obs
, and done
?
self.max_episode_steps = 108000
This is way too large and will consume too much memory. Lower this to like 512.
noted, I can bring that down to 512 and test it again, not sure if that was the main cause. I can do further tests of the reward / 100.
Max episode steps won't be the cause for bad training results. This makes the runtime less efficient.
Most importantly verify your environment. Based on the tensorboard summary it really looks like that the agent receives always 0 as reward.
Certainly, this might be the case I will change it to "return obs, reward, done, info" instead and see if this makes a difference. I will do another run later tonight, and see what happens. I'll keep you posted
Thank you
You can write a script that runs your environment wrapper using random actions. This way, you can more easily debug and verify your environment.
now i am getting this odd error if self.max_episode_steps = 512. so I changed it to self.max_episode_steps = 10000
"/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [0,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [1,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [2,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [3,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [4,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [5,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [6,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [7,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [8,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [9,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [10,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [11,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [12,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [13,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [14,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [15,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [16,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [17,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [18,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [19,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [20,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [21,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [22,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [23,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [24,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [25,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [26,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [27,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [28,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [29,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [30,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [31,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed."
What is your memory_length
? It cannot be greater than max_episode_steps
.
memory_length is 16. Okay, I did another training last night here are the results. I zipped and uploaded the summary here.
summaries.tar.gz
here is the yaml file:
environment: type: "Atari" name: PongNoFrameskip-v4 gamma: 0.995 lamda: 0.95 updates: 200000 epochs: 4 n_workers: 4 worker_steps: 128 n_mini_batch: 8 value_loss_coefficient: 0.5 hidden_layer_size: 384 max_grad_norm: 0.5 transformer: num_blocks: 1 embed_dim: 128 num_heads: 4 memory_length: 16 positional_encoding: "relative" # options: "" "relative" "learned" layer_norm: "pre" # options: "" "pre" "post" gtrxl: False gtrxl_bias: 0.0 learning_rate_schedule: initial: 3.5e-4 final: 1.0e-4 power: 1.0 max_decay_steps: 250 beta_schedule: initial: 0.001 final: 0.001 power: 1.0 max_decay_steps: 10000 clip_range_schedule: initial: 0.1 final: 0.1 power: 1.0 max_decay_steps: 10000
okay, sorry! I fixed the issue, it was definitely in the setup of the environment.
Glad you found it!