This project provides a Gradio UI for marking preferences of human feedback on generated text. This could be used to train a reward model for RLHF.
There are two files, app.py
for basic and advanced_app.py
for advanced usage. Both are heavily inspired by Anthropic’s “Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback” paper.
NOTE: basic version is fully functioning except you have to fill in record
function for your specific use case. That is how you would like to handle the clicks on chosen preferences. However, the advanced version is not fully functioning application. Instead, it provides only the UI.
The basic version is demonstrated with Flan Alpaca model. All you need to do is the followings in app.py
:
- replace
model
variable with your own model - replace
GenerationConfig
with your own choice - complete
record()
function.- each choice on
A
andB
is scored between 1 to 4 - do whatever action you want with the scores
- each choice on