ThinamXx/orpo-demo

Fig a. Comparison of model alignment techniques.

ORPO: Monolithic Preference Optimization without Reference Model

Working on fine-tuning a model using ORPO.

The paper introduced a reference-free monolithic preference alignment method, odds ratio preference optimization (ORPO), by revisiting and understanding the value of the supervised fine-tuning (SFT) phase in the context of preference alignment. ORPO was consistently preferred by the fine-tuned reward model against SFT and RLHF across the scale, and the win rate against DPO increased as the size of the model increased.

Citation

1. Hong, J., Lee, N., & Thorne, J. (2024). ORPO: Monolithic Preference Optimization without Reference Model. ArXiv. /abs/2403.07691

About

Working on the implementation of ORPO

MIT License

Languages

Language:Python 94.1%Language:Shell 5.9%