Any plan for supporting DPO?
lorabit110 opened this issue Β· comments
Yanan Xie commented
π Feature Request
Support DPO (Direct Preference Optimization) loss and data loader.
Motivation
Many recent open LLMs have achieved promising results from using DPO instead of RL-style tuning like PPO for alignment. And it seems to require less changes to llm-foundry than RLHF.
David Preti commented
same question here