Outcome-supervised Value Model

I. Introduction

This project is the main idea of the Outcome-supervised Verifiers for Planning in Mathematical Reasoning paper that I implemented from scratch.
Large language models (LLMs) often struggle with maintaining accuracy across a sequence of intermediate reasoning steps in mathematical reasoning, leading to error propagation that undermines the final result. The current methodology to mitigate this issue primarily involves using a verifier model to assess the correctness of generated solution candidates, focusing either on the overall reasoning path or on an incomplete reasoning path (quote the authors).
In this repository I focus on an an incomplete reasoning path, which the authors report achieves higher performance than the overall reasoning path for performing the answering Vietnamese Elementary Math questions task (motivation: Zalo AI Challenge 2023 exam, track ELEMENTARY MATHS SOLVING).

II. Method

For The overall reasoning path, use the verifier model to verify the complete path (also known as the Outcome-supervised Reward Model - ORM).
For An Incomplete reasoning path, use the verifier model to verify the intermediate steps in verify path verification (also known as Outcome-supervised Value Model - OVM).
In this paper, the authors evaluate using OVM as better than using ORM.
For training the OVM, it needs to be divided into 2 stages: training generator (being a LLM) and training verifier (the generator is appended with a linear layer)
In training the verifier, I freeze all the weights of the generator and only the linear layer.

III. Dataset

The dataset used to train the generator I use is 7500 samples of the GSM8K dataset (translated into Vietnamese)
The dataset I used to train the verifier is 375000 samples generated by the generator after training (using the generator, receiving as input 7500 trained samples from the GSM8K translated dataset, each sample generated by the generator 50 candidates).
The GSM8K dataset is completely translated using Google Translate so the quality will not be good. So in this repo I only demo the training and testing method. When I find a high quality dataset for Vietnamese, I will retrain and evaluate the results later.

IV. Model

The base model I use is Mistral 7B. Due to limited computational resources, I used the QLoRA-4bit method to train the generator.

V. Test

Run all cells in inference.ipynb

This is a paper with a very good idea, my repository implements the main idea of the paper in a simpler way (mainly focuses on ideas), you can see the authors' original repo here.
Hopefully this method will open up many new directions of thinking and new ideas in reasoning problems. Thank you a lot for the finding! 😊

longday1102 / OVM