LuJunru / SamPO

Eliminating Biased Length Reliance of Direct Preference Optimization via Down-Sampled KL Divergence

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SamPO

We provide codes and models for SamPO in this repository. Please refer to the paper for details: Eliminating Biased Length Reliance of Direct Preference Optimization via Down-Sampled KL Divergence. In short, we suggest that the discrepancy between sequence-level KL divergences between chosen and rejected sequences, used in DPO, results in overestimated or underestimated rewards due to varying token lengths, leading to the verbosity issue. We then introduce an effective downsampling regularization approach, named SamPO.

Environment

We provide requirements.txt for your convenience.

Key Difference between SamPO and DPO

A quick check of the Key Difference.

Fine-tuning

Run `bash tasks.sh` for all DPO and all variants, including our SamPO.

Evaluation

For five conditional benchmarks, we use lm-evaluation-harness:

  • GSM8K: 8-shot, report strict match
  • IFEval: 3-shot, report instruction-level strict accuracy
  • PiQA: 3-shot, report accuracy
  • MMLU: 0-shot, report normalized accuracy
  • TruthfulQA: 3-shot, report accuracy of single-true mc1 setting

For AlpacaEval2, we use official alpaca_eval:

  • AlpacaEval2: win rate (%)
  • LC AlpacaEval2: length-debiased win rate (%) of AlpacaEval2

For HH-RLHF & TL;DR, we use the same GPT-4 Win rate prompt template proposed by the DPO:

  • Win rate (%): a win rate between fine-tune models vs. SFT basis

Model Weights & Performance

Name Share Link
Pythia-2.8B-HH-RLHF-Iterative-SamPO HF Link
Pythia-2.8B-TLDR-Iterative-SamPO HF Link
Llama-3-8B-Instruct-Iterative-SamPO HF Link

Note: test sets of HH-RLHF and TLDR are released in the above link as well.

Acknowledgment

This code is built upon the TRL repository.

Citation

@article{LUandLI2024SamPO,
  title={Eliminating Biased Length Reliance of Direct Preference Optimization via Down-Sampled KL Divergence},
  author={Lu, Junru and Li, Jiazheng and An, Siyu and Zhao, Meng and He, Yulan and Yin, Di and Sun, Xing},
  journal={arXiv preprint arXiv:2406.10957},
  year={2024}
}

About

Eliminating Biased Length Reliance of Direct Preference Optimization via Down-Sampled KL Divergence

License:MIT License


Languages

Language:Python 95.1%Language:Shell 4.9%