CarperAI / trlx

A repo for distributed training of language models with Reinforcement Learning via Human Feedback (RLHF)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Reward model negative numbers meaning

GenVr opened this issue · comments

commented

Hi everyone,
I'm training a gpt-j on my task (a language generation task) following the summarize_rlhf code. I completed step 1 and the network is fine. But in step two I created my dataset with prompt, chosen and rejected rows. In the training of the reward model, I obtain a lot of negative values as score. I saw that I have negative scores also for truth inputs, something like:

Input               |  [...]  |  score_pred  |  score_truth
MyString            |  [...]  |     -5.512   |    -5.926

Sometimes it gives positive results, but it seems a bit strange. I followed the proposed code in that folder for the inference.
I also performed step 3 with this reward model, and the final network got worse.

Is it normal for the reward model to give these negative results? Also on truth?

Thanks.

Actually the reward model is trained like f(x)=. r(x) - r(x_ref) where x_ref is the gold summary. Here, r(x_ref) is obviously expected to be more than r(x), so final reward can come out to be negative. See the later section of the article Summarization

commented

It should be normal for the reward to take on negative values and also to have its distribution not be centered around zero, since only the ordering matters for the loss.

@sohamdats this is incorrect, since reward model was not trained in this manner but only its scores were normalized like so during inference. Still the final reward can be negative regardless of whether normalization was applied