MarcoMeter / episodic-transformer-memory-ppo

Clean baseline implementation of PPO using an episodic TransformerXL memory

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Atari env

hlsafin opened this issue · comments

Hello, I attempted to set up an Atari environment and train using your current code, but unfortunately, I was unable to learn anything during training. Could you suggest any possible explanations for this and recommend specific hyperparameters that I could experiment with for Pong?

Hi @hlsafin
which hyperparameters did you initially try? Could you check the tensorboard summary for vanishing gradients?

Here is my yaml file. I couldn't go too crazy with the architecture because I'm using a RTX 3080 and the VRAM is ~16G
I didn't record tensorboard because I was just doing initial tests where I printed the rewards when the game ends and essentially all the rewards were around 0-2 after 8 hours of training with this setup.
environment: type: "Atari" name: BreakoutNoFrameskip-v4 gamma: 0.995 lamda: 0.95 updates: 200000 epochs: 4 n_workers: 2 worker_steps: 128 n_mini_batch: 8 value_loss_coefficient: 0.5 hidden_layer_size: 128 max_grad_norm: 0.5 transformer: num_blocks: 3 embed_dim: 128 num_heads: 4 memory_length: 64 positional_encoding: "relative" # options: "" "relative" "learned" layer_norm: "post" # options: "" "pre" "post" gtrxl: False gtrxl_bias: 0.0 learning_rate_schedule: initial: 3.5e-4 final: 1.0e-4 power: 1.0 max_decay_steps: 250 beta_schedule: initial: 0.001 final: 0.001 power: 1.0 max_decay_steps: 10000 clip_range_schedule: initial: 0.1 final: 0.1 power: 1.0 max_decay_steps: 10000

My shallow take on this is too shrink down the number of transformer blocks to 1, while increasing the embedding and hidden layer size to 256 or 384. If possible increase the overall batch size by utilizing more workers.

It is useful to monitor the gradients. If vanishing gradients occur, you should change layer norm from "post" to "pre".

You should also be able to reduce the memory length to like 16. Pong is rather a short-term memory problem. Theoretically a memory length of 4 should suffices as this is the number frames that are usually stacked to solve this environment.

Check your learning rate schedule. I'd suggest to go with a constant learning rate of 2.0e-4 for now. The beta schedule (entropy bonus) could be inspired by related work that train pong with frame stacking and PPO.

There is another possibility to save memory, you could collect all training data on CPU, while mini batches are pushed to the GPU one at a time for optimization. This should allow for larger batches, which is quite helpful.

Hope this helps.

Yeah, still I am not getting any good results. I might be doing something wrong here. who knows.

Could you provide a tensorboard summary?

and this is breakout
image

I mean the entire summary file as I'm interested in all monitored stats. Your current training config would be helpful as well.

image

image

image

image

image

Just attach the summary file to this issue via drag and drop.
Please provide your current config.

By looking at the screenshots the norm of the value function's gradients is notably sticking out.
The value functions suffers from the vanishing gradient issue.
Also, the monitored value mean sticks around zero.
There most be something wrong with your Atari environment wrapper.
Could it be that the step function does not properly return the reward?
The tensorboard summary shows the return that is part of the info dictionary. So there could be an issue with your step() function.

Could you provide your environment code?

import gym
import numpy as np
import time
import collections

class MaxAndSkipEnv(gym.Wrapper):
def init(self, env=None, skip=1):
"""Return only every skip-th frame"""
super(MaxAndSkipEnv, self).init(env)
# most recent raw observations (for max pooling across time steps)
self._obs_buffer = collections.deque(maxlen=2)
self._skip = skip

def step(self, action):
    total_reward = 0.0
    done = None
    for _ in range(self._skip):
        obs, reward, done, info = self.env.step(action)
        self._obs_buffer.append(obs)
        total_reward += reward
        if done:
            break
    max_frame = np.max(np.stack(self._obs_buffer), axis=0)
    return max_frame, total_reward, done, info

def reset(self):
    """Clear past frame buffer and init. to first obs. from inner env."""
    self._obs_buffer.clear()
    obs = self.env.reset()
    self._obs_buffer.append(obs)
    return 

class Atari:
def init(self, env_name ):
self._env = gym.make(env_name)
#self.max_episode_steps = self._env.spec.max_episode_steps
self.max_episode_steps = 108000
#self._env = MaxAndSkipEnv(self._env, skip=4)
self._env = gym.wrappers.ResizeObservation(self._env, (int(84/1), int(84/1)))
self._env = gym.wrappers.GrayScaleObservation(self._env )
self._env = gym.wrappers.FrameStack(self._env, 1)

@property
def observation_space(self):
    return self._env.observation_space

@property
def action_space(self):
    return self._env.action_space

def reset(self):
    self._rewards = []
    obs = self._env.reset()
    obs = np.stack(obs[0]._frames)
    return obs 

def step(self, action):
    #obs, reward, done, info = self._env.step(action[0])
    obs, reward, done,_, info = self._env.step(action[0])
    obs = np.stack(obs._frames)
    self._rewards.append(reward)
    if done:
        info = {"reward": sum(self._rewards),
                "length": len(self._rewards)}
        print(sum(self._rewards))
    else:
        info = None
    return obs , reward / 100.0, done, info

def render(self):
    self._env.render()
    time.sleep(0.033)

def close(self):
    self._env.close()

Did you test your wrapper and print reward / 100.0, obs, and done?

self.max_episode_steps = 108000
This is way too large and will consume too much memory. Lower this to like 512.

noted, I can bring that down to 512 and test it again, not sure if that was the main cause. I can do further tests of the reward / 100.

Max episode steps won't be the cause for bad training results. This makes the runtime less efficient.

Most importantly verify your environment. Based on the tensorboard summary it really looks like that the agent receives always 0 as reward.

Certainly, this might be the case I will change it to "return obs, reward, done, info" instead and see if this makes a difference. I will do another run later tonight, and see what happens. I'll keep you posted
Thank you

You can write a script that runs your environment wrapper using random actions. This way, you can more easily debug and verify your environment.

now i am getting this odd error if self.max_episode_steps = 512. so I changed it to self.max_episode_steps = 10000

"/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [0,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [1,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [2,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [3,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [4,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [5,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [6,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [7,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [8,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [9,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [10,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [11,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [12,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [13,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [14,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [15,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [16,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [17,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [18,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [19,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [20,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [21,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [22,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [23,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [24,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [25,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [26,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [27,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [28,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [29,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [30,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [31,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed."

What is your memory_length? It cannot be greater than max_episode_steps.

memory_length is 16. Okay, I did another training last night here are the results. I zipped and uploaded the summary here.
summaries.tar.gz

here is the yaml file:
environment: type: "Atari" name: PongNoFrameskip-v4 gamma: 0.995 lamda: 0.95 updates: 200000 epochs: 4 n_workers: 4 worker_steps: 128 n_mini_batch: 8 value_loss_coefficient: 0.5 hidden_layer_size: 384 max_grad_norm: 0.5 transformer: num_blocks: 1 embed_dim: 128 num_heads: 4 memory_length: 16 positional_encoding: "relative" # options: "" "relative" "learned" layer_norm: "pre" # options: "" "pre" "post" gtrxl: False gtrxl_bias: 0.0 learning_rate_schedule: initial: 3.5e-4 final: 1.0e-4 power: 1.0 max_decay_steps: 250 beta_schedule: initial: 0.001 final: 0.001 power: 1.0 max_decay_steps: 10000 clip_range_schedule: initial: 0.1 final: 0.1 power: 1.0 max_decay_steps: 10000

okay, sorry! I fixed the issue, it was definitely in the setup of the environment.

Glad you found it!