takuseno / d3rlpy

An offline deep reinforcement learning library

Home Page:https://takuseno.github.io/d3rlpy

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[QUESTION] Continously increasing loss and TD error

SuryaPratapSingh37 opened this issue · comments

Hi, I was just getting started with this amazing d3rlpy library, and wanted to train a very simple policy using DQN on the cartpole environment. But I'm not sure why the loss and TD errors (both validation & training) keep increasing. I tried increasing the n_steps & n_steps_per_epoch but no success. Even it had been over-fitting, then atleast the loss and training TD error should've been decreasing. Can you please help?

Attaching the code & plots below

import d3rlpy
from d3rlpy.datasets import get_cartpole # CartPole-v0 dataset
from d3rlpy.datasets import get_pendulum # Pendulum-v0 dataset
# from d3rlpy.datasets import get_pybullet # PyBullet task datasets
from d3rlpy.datasets import get_atari    # Atari 2600 task datasets
from d3rlpy.dataset import create_infinite_replay_buffer
from d3rlpy.algos import DQNConfig
import matplotlib.pyplot as plt
import pandas as pd
dataset, env = get_cartpole()

from sklearn.model_selection import train_test_split

train_episodes, test_episodes = train_test_split(dataset.episodes, test_size=0.2)
train_dataset = create_infinite_replay_buffer(episodes=train_episodes)

from d3rlpy.algos import DQN

dqn = DQNConfig().create()

# Track validation TD error
val_td_scorer = d3rlpy.metrics.TDErrorEvaluator(episodes=test_episodes)
# Track training TD error
train_td_scorer = d3rlpy.metrics.TDErrorEvaluator()

dqn.fit(train_dataset, n_steps=100000, n_steps_per_epoch=10000, evaluators={"val_td_scorer": val_td_scorer, "train_td_scorer":train_td_scorer})

df = pd.read_csv('d3rlpy_logs/DQN_20240418233428/loss.csv', header=None)
df2 = pd.read_csv('d3rlpy_logs/DQN_20240418233428/val_td_scorer.csv', header=None)
df3 = pd.read_csv('d3rlpy_logs/DQN_20240418233428/train_td_scorer.csv', header=None)

plt.plot(df[1], df[2])
plt.plot(df2[1], df2[2])
plt.plot(df3[1], df3[2])
plt.legend(['loss', 'val_td_scorer', 'train_td_scorer'])
plt.show()

cartpole_train_results

@SuryaPratapSingh37 Hi, thanks for the issue. Generally speaking, there is not usually convergence in deep RL training because of the nonstationary nature. So it is what it is. Also, the TD error is not a really good metrics. People don't usually use it to measure training progress.

@takuseno Thanks for your reply. So could u pls guide like exactly how should I be changing the above code to make it converge? I felt cartpole is a pretty simple environment & at least the loss should have been decreasing (if not overfitting), and secondly, if not TD error what else should I use here to examine the training loss (for finding whether its overfitting or not)?

Sadly, particularly in offline deep RL, it's very difficult to prevent divergence. So my recommendation is to give up on the convergence. Also, in offline RL, there is no good metrics to measure policy performance yet. I'd direct you to this documentation and a paper:

Offline deep RL still needs a lot of inventions to make it practical 😓

Ohh.....do you know whether during training the transitions are sampled randomly from the replay buffer (if not how to randomly shuffle the transitions)?

Yes, the mini-batch is uniformly sampled from the buffer.

Another suggestion to prevent the divergence is using offline RL algorithms. For now, it looks like you're using DQN, which is designed for online training. If you use DiscreteCQL instead, you might get the better results in offline setting.

Please let me close this issue since it's simply a nature of offline RL.