YuxiXie / MCTS-DPO

This is the repository that contains the source code for the Self-Evaluation Guided MCTS for online DPO.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning

This repository contains code and analysis for the paper: Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning. Below is the framework of our proposed method.

Model Framework

Environment Setup

conda env create --file conda-recipe.yaml
pip install -r requirements.txt

Run MCTS-DPO

Our main code include ./mcts_rl/algorithms/mcts and ./mcts_rl/trainers/tsrl_trainer.py

To run MCTS-DPO for MathQA on Mistral (SFT):

bash scripts/mcts_mathqa.sh

To run MCTS-DPO for CSR on Mistral (SFT):

bash scripts/mcts_csr.sh

Citation

@article{xie2024monte,
  title={Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning},
  author={Xie, Yuxi and Goyal, Anirudh and Zheng, Wenyue and Kan, Min-Yen and Lillicrap, Timothy P and Kawaguchi, Kenji and Shieh, Michael},
  journal={arXiv preprint arXiv:2405.00451},
  year={2024}
}

This repository is adapted from the code of the works Safe-RLHF.

About

This is the repository that contains the source code for the Self-Evaluation Guided MCTS for online DPO.

License:Apache License 2.0


Languages

Language:Python 93.1%Language:Shell 6.9%