My loss graph seems weird. (PPO training)

Question

My loss graph seems weird. (PPO training)

sooftware opened this issue a year ago · comments

Loss graph

I'm trying to train 'examples/summarization_rlhf' by changing the dataset a little bit, (the Reward Model trained well I think), but the Step-3 PPO Training graph is as above.

Can policy losses have a negative value?

And are there any expected problems?

Soohwan Kim · Answer 1 · Thu Jun 01 2023 09:48:33 GMT+0800 (China Standard Time)

My dataset format

Prompt

Blah, blah, blah<|sep|>A:Blah, blah, blah.<|sep|>B:

Answer

Blah, blah, blah.

Soohwan Kim · Answer 2 · Thu Jun 01 2023 09:52:38 GMT+0800 (China Standard Time)

Reward graph

Soohwan Kim · Answer 3 · Thu Jun 01 2023 09:53:13 GMT+0800 (China Standard Time)

Is it because the Reward average is negative?

Alex Havrilla · Answer 4 · Thu Jun 01 2023 22:19:56 GMT+0800 (China Standard Time)

If the reward is not normalized than rewards can (and often will) be negative. Can you share your hyperparameters/setup?

Max · Answer 5 · Thu Jun 01 2023 22:40:02 GMT+0800 (China Standard Time)

Yeah, negative average reward is nothing to worry about as long as it increases, and policy loss can have a negative value, since it's not bound by zero. If the training doesn't achieve the result you expect, firstly check if your newly trained (did you use custom dataset for the reward model as well?) reward model behaves correctly (or you can plug any other reward model just for testing) and secondly, you can tune hyperparameters to maximize the reward you get, using these instructions https://github.com/CarperAI/trlx#use-ray-tune-to-launch-hyperparameter-sweep

Soohwan Kim · Answer 6 · Thu Jun 01 2023 23:27:51 GMT+0800 (China Standard Time)

A normal loss graph gradually drops to a curve, is this PPO loss graph normal? Or is that graph weird?

Max · Answer 7 · Fri Jun 02 2023 04:05:04 GMT+0800 (China Standard Time)

Yes, your loss graph looks completely normal, for the reference this how other PPO loss graphs look like in regular PPO implementation on Atari

taken from https://github.com/vwxyzjn/cleanrl
https://wandb.ai/openrlbenchmark/baselines/runs/1vldj6yx

Soohwan Kim · Answer 8 · Fri Jun 02 2023 14:26:14 GMT+0800 (China Standard Time)

Thank you!