CarperAI / trlx

A repo for distributed training of language models with Reinforcement Learning via Human Feedback (RLHF)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SFT wrong for Anthropic-HH, leading to poor model quality?

eric-mitchell opened this issue Β· comments

πŸ› Describe the bug

Please correct me if I'm wrong, but it looks like SFT for Anthropic simply maximizes log p(x) on the entire dialogue history, rather than only maximizing log p(y|x), where x is the dialogue history and y is the final assistant response.

See here, where the concatenation of prompt and response is passed to the trainer.

Since the "chosen" label is only meaningful for the final assistant response, if my interpretation is corrrect, SFT is fine-tuning on mostly bad examples. I observed this after evaluating the pre-trained PPO model model = transformers.AutoModelForCausalLM.from_pretrained('reciprocate/ppo_hh_pythia-6B') using GPT-4 as the proxy human, and found its win rate to be worse than a simple baseline that only fine-tunes on the chosen response (not the whole history).

Which trlX version are you using?

No response

Additional system and package information

No response

commented

Hi! You're right, to finetune only on responses one has to pass a tuple of (prompt, output) instead of a single string, as it is done in this script. However the base model https://huggingface.co/Dahoas/pythia-6B-static-sft used for https://huggingface.co/reciprocate/ppo_hh_pythia-6B was also trained with masked loss https://github.com/Dahoas/reward-modeling/blob/main/configs/base_configs/gptneox.yaml, so that's perhaps not the main reason why the model might perform worse than your baseline. Also have you found empirically that finetuning only on chosen responses versus whole samples to be better under GPT-4 eval? We could change training code here if that's the case.

It's also worth pointing out in their paper Anthropic claims fine-tuning on both prompt and response is about as good as just fine-tuning on response for this dataset. Additionally the dialogue trees for the helpful dataset are constructed by continuing with the preferred response, so I don't think any fine-tuning is being done on rejected responses.