CarperAI / trlx

A repo for distributed training of language models with Reinforcement Learning via Human Feedback (RLHF)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

My loss graph seems weird. (PPO training)

sooftware opened this issue · comments

  • Loss graph
image

I'm trying to train 'examples/summarization_rlhf' by changing the dataset a little bit, (the Reward Model trained well I think), but the Step-3 PPO Training graph is as above.

Can policy losses have a negative value?

And are there any expected problems?

My dataset format

  • Prompt
Blah, blah, blah<|sep|>A:Blah, blah, blah.<|sep|>B:
  • Answer
Blah, blah, blah.
  • Reward graph
image

Is it because the Reward average is negative?

If the reward is not normalized than rewards can (and often will) be negative. Can you share your hyperparameters/setup?

commented

Yeah, negative average reward is nothing to worry about as long as it increases, and policy loss can have a negative value, since it's not bound by zero. If the training doesn't achieve the result you expect, firstly check if your newly trained (did you use custom dataset for the reward model as well?) reward model behaves correctly (or you can plug any other reward model just for testing) and secondly, you can tune hyperparameters to maximize the reward you get, using these instructions https://github.com/CarperAI/trlx#use-ray-tune-to-launch-hyperparameter-sweep

A normal loss graph gradually drops to a curve, is this PPO loss graph normal? Or is that graph weird?

commented

Yes, your loss graph looks completely normal, for the reference this how other PPO loss graphs look like in regular PPO implementation on Atari

Screenshot 2023-06-01 at 23 04 02

taken from https://github.com/vwxyzjn/cleanrl
https://wandb.ai/openrlbenchmark/baselines/runs/1vldj6yx

Thank you!