[QUESTION] Continously increasing loss and TD error

Question

[QUESTION] Continously increasing loss and TD error

SuryaPratapSingh37 opened this issue 3 months ago · comments

Surya Pratap Singh commented 3 months ago

Hi, I was just getting started with this amazing d3rlpy library, and wanted to train a very simple policy using DQN on the cartpole environment. But I'm not sure why the loss and TD errors (both validation & training) keep increasing. I tried increasing the n_steps & n_steps_per_epoch but no success. Even it had been over-fitting, then atleast the loss and training TD error should've been decreasing. Can you please help?

Attaching the code & plots below

import d3rlpy
from d3rlpy.datasets import get_cartpole # CartPole-v0 dataset
from d3rlpy.datasets import get_pendulum # Pendulum-v0 dataset
# from d3rlpy.datasets import get_pybullet # PyBullet task datasets
from d3rlpy.datasets import get_atari    # Atari 2600 task datasets
from d3rlpy.dataset import create_infinite_replay_buffer
from d3rlpy.algos import DQNConfig
import matplotlib.pyplot as plt
import pandas as pd
dataset, env = get_cartpole()

from sklearn.model_selection import train_test_split

train_episodes, test_episodes = train_test_split(dataset.episodes, test_size=0.2)
train_dataset = create_infinite_replay_buffer(episodes=train_episodes)

from d3rlpy.algos import DQN

dqn = DQNConfig().create()

# Track validation TD error
val_td_scorer = d3rlpy.metrics.TDErrorEvaluator(episodes=test_episodes)
# Track training TD error
train_td_scorer = d3rlpy.metrics.TDErrorEvaluator()

dqn.fit(train_dataset, n_steps=100000, n_steps_per_epoch=10000, evaluators={"val_td_scorer": val_td_scorer, "train_td_scorer":train_td_scorer})

df = pd.read_csv('d3rlpy_logs/DQN_20240418233428/loss.csv', header=None)
df2 = pd.read_csv('d3rlpy_logs/DQN_20240418233428/val_td_scorer.csv', header=None)
df3 = pd.read_csv('d3rlpy_logs/DQN_20240418233428/train_td_scorer.csv', header=None)

plt.plot(df[1], df[2])
plt.plot(df2[1], df2[2])
plt.plot(df3[1], df3[2])
plt.legend(['loss', 'val_td_scorer', 'train_td_scorer'])
plt.show()

Takuma Seno · Answer 1 · Sat Apr 27 2024 12:57:42 GMT+0800 (China Standard Time)

@SuryaPratapSingh37 Hi, thanks for the issue. Generally speaking, there is not usually convergence in deep RL training because of the nonstationary nature. So it is what it is. Also, the TD error is not a really good metrics. People don't usually use it to measure training progress.

Surya Pratap Singh · Answer 2 · Sat Apr 27 2024 16:59:30 GMT+0800 (China Standard Time)

@takuseno Thanks for your reply. So could u pls guide like exactly how should I be changing the above code to make it converge? I felt cartpole is a pretty simple environment & at least the loss should have been decreasing (if not overfitting), and secondly, if not TD error what else should I use here to examine the training loss (for finding whether its overfitting or not)?

Takuma Seno · Answer 3 · Sat Apr 27 2024 17:27:58 GMT+0800 (China Standard Time)

Sadly, particularly in offline deep RL, it's very difficult to prevent divergence. So my recommendation is to give up on the convergence. Also, in offline RL, there is no good metrics to measure policy performance yet. I'd direct you to this documentation and a paper:

Offline deep RL still needs a lot of inventions to make it practical 😓

Surya Pratap Singh · Answer 4 · Sat Apr 27 2024 19:41:07 GMT+0800 (China Standard Time)

Ohh.....do you know whether during training the transitions are sampled randomly from the replay buffer (if not how to randomly shuffle the transitions)?

Takuma Seno · Answer 5 · Sat Apr 27 2024 19:49:35 GMT+0800 (China Standard Time)

Yes, the mini-batch is uniformly sampled from the buffer.

Takuma Seno · Answer 6 · Sat Apr 27 2024 19:50:41 GMT+0800 (China Standard Time)

Another suggestion to prevent the divergence is using offline RL algorithms. For now, it looks like you're using DQN, which is designed for online training. If you use DiscreteCQL instead, you might get the better results in offline setting.

Takuma Seno · Answer 7 · Sun May 19 2024 12:41:30 GMT+0800 (China Standard Time)

Please let me close this issue since it's simply a nature of offline RL.