takuseno / d3rlpy

An offline deep reinforcement learning library

Home Page:https://takuseno.github.io/d3rlpy

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[QUESTION] Plotting average reward

HateBunnyPlzzz opened this issue · comments

So I am willing to plot the average return/total reward for a algorithm trained for about 100K timesteps.

To Reproduce

from d3rlpy.dataset import create_fifo_replay_buffer
from d3rlpy.algos import ConstantEpsilonGreedy
import gym
import torch
import d3rlpy

# random state
random_state = 12345
device = "cuda:0" if torch.cuda.is_available() else "cpu"
env = gym.make("MountainCarContinuous-v0")

# data collection while training SAC
ddpg = d3rlpy.algos.DDPGConfig().create(device=device)
buffer = create_fifo_replay_buffer(limit=100000, env=env)
explorer = ConstantEpsilonGreedy(0.3)
ddpg.fit_online(env, buffer, explorer, n_steps=100000)

# saving the buffer dataset
with open("DDPG_Mountain-Car_continuous_replay_buffer.h5", "w+b") as f:
    buffer.dump(f)

So in the log files, i had the critic and the actor loss logged inside a csv file, which I did for 3 different experiments
DDPG, TD3 and SAC on Mountain-Car-Continuous.
I'll share the plots for both actor and critic loss. But how can I plot the same curve for the reward (which is obviously an ideal metric to be used when comparing performances)

image

This is not looking good, as there is an abnormal convergence behavior for all these algorithms, the code is very basic and I did change nothing in terms of hyperparameters etc.

Thank you!

@HateBunnyPlzzz Thanks for the issue. I need to mention a couple of things:

To do evaluation during training, you need to pass eval_env to fit_online method:

env = gym.make("MountainCarContinuous-v0")
eval_env = gym.make("MountainCarContinuous-v0")
...
ddpg.fit_online(env, buffer, explorer, n_steps=100000, eval_env=eval_env)

ConstantEpsilonGreedy is mainly for discrete action-space tasks. I recomment NormalNoise instead.
https://d3rlpy.readthedocs.io/en/v2.3.0/references/generated/d3rlpy.algos.NormalNoise.html#d3rlpy.algos.NormalNoise

@HateBunnyPlzzz Thanks for the issue. I need to mention a couple of things:

To do evaluation during training, you need to pass eval_env to fit_online method:

env = gym.make("MountainCarContinuous-v0")
eval_env = gym.make("MountainCarContinuous-v0")
...
ddpg.fit_online(env, buffer, explorer, n_steps=100000, eval_env=eval_env)

ConstantEpsilonGreedy is mainly for discrete action-space tasks. I recomment NormalNoise instead. https://d3rlpy.readthedocs.io/en/v2.3.0/references/generated/d3rlpy.algos.NormalNoise.html#d3rlpy.algos.NormalNoise

Thank you so much for your response!
I have one more thing to ask. To plot for the reward, do I need to plot the rollout-returns csv log file or the eval_env returns anything different while logging.
Thankyou again, I really appreciate the efforts.

The rollout return is the return collected during training episodes. So please use eval return for performace comparison.

Let me close this issue since the issue seems resolved. Feel free to reopen this if there is any further question.