[QUESTION] Plotting average reward

Question

[QUESTION] Plotting average reward

HateBunnyPlzzz opened this issue 6 months ago · comments

Shubhanshu Khatana commented 6 months ago

So I am willing to plot the average return/total reward for a algorithm trained for about 100K timesteps.

To Reproduce

from d3rlpy.dataset import create_fifo_replay_buffer
from d3rlpy.algos import ConstantEpsilonGreedy
import gym
import torch
import d3rlpy

# random state
random_state = 12345
device = "cuda:0" if torch.cuda.is_available() else "cpu"
env = gym.make("MountainCarContinuous-v0")

# data collection while training SAC
ddpg = d3rlpy.algos.DDPGConfig().create(device=device)
buffer = create_fifo_replay_buffer(limit=100000, env=env)
explorer = ConstantEpsilonGreedy(0.3)
ddpg.fit_online(env, buffer, explorer, n_steps=100000)

# saving the buffer dataset
with open("DDPG_Mountain-Car_continuous_replay_buffer.h5", "w+b") as f:
    buffer.dump(f)

So in the log files, i had the critic and the actor loss logged inside a csv file, which I did for 3 different experiments
DDPG, TD3 and SAC on Mountain-Car-Continuous.
I'll share the plots for both actor and critic loss. But how can I plot the same curve for the reward (which is obviously an ideal metric to be used when comparing performances)

This is not looking good, as there is an abnormal convergence behavior for all these algorithms, the code is very basic and I did change nothing in terms of hyperparameters etc.

Thank you!

Takuma Seno · Answer 1 · Mon Jan 15 2024 08:33:38 GMT+0800 (China Standard Time)

@HateBunnyPlzzz Thanks for the issue. I need to mention a couple of things:

To do evaluation during training, you need to pass eval_env to fit_online method:

env = gym.make("MountainCarContinuous-v0")
eval_env = gym.make("MountainCarContinuous-v0")
...
ddpg.fit_online(env, buffer, explorer, n_steps=100000, eval_env=eval_env)

ConstantEpsilonGreedy is mainly for discrete action-space tasks. I recomment NormalNoise instead.
https://d3rlpy.readthedocs.io/en/v2.3.0/references/generated/d3rlpy.algos.NormalNoise.html#d3rlpy.algos.NormalNoise

Shubhanshu Khatana · Answer 2 · Mon Jan 15 2024 16:20:09 GMT+0800 (China Standard Time)

@HateBunnyPlzzz Thanks for the issue. I need to mention a couple of things:

To do evaluation during training, you need to pass eval_env to fit_online method:
env = gym.make("MountainCarContinuous-v0")
eval_env = gym.make("MountainCarContinuous-v0")
...
ddpg.fit_online(env, buffer, explorer, n_steps=100000, eval_env=eval_env)
ConstantEpsilonGreedy is mainly for discrete action-space tasks. I recomment NormalNoise instead. https://d3rlpy.readthedocs.io/en/v2.3.0/references/generated/d3rlpy.algos.NormalNoise.html#d3rlpy.algos.NormalNoise

Thank you so much for your response!
I have one more thing to ask. To plot for the reward, do I need to plot the rollout-returns csv log file or the eval_env returns anything different while logging.
Thankyou again, I really appreciate the efforts.

Takuma Seno · Answer 3 · Mon Jan 15 2024 18:58:26 GMT+0800 (China Standard Time)

The rollout return is the return collected during training episodes. So please use eval return for performace comparison.

Takuma Seno · Answer 4 · Thu Jan 18 2024 19:08:30 GMT+0800 (China Standard Time)

Let me close this issue since the issue seems resolved. Feel free to reopen this if there is any further question.