Measuring and Reducing Bias in LLMs introduced by RLHF

CS224R Final Project

RLHF Training

There are three main steps to the RLHF training process:

Supervised fine-tuning of the base LLM to create SFT LLM:
- ./scripts/supervised_finetuning.sh <SFT_MODEL_NAME>
Reward modeling using dialog pairs from the StackExchange dataset using SFT LLM to create RM:
- scripts/reward_modeling.sh <RM_MODEL_NAME>
RL fine-tuning of SFT LLM with the reward model:
- ./scripts/rl_training.sh <SFT_MODEL_NAME> <RM_MODEL_NAME> <NUM_TRAINING_EXAMPLES>

LoRA layers were using at all stages to reduce memory requirements. At each stage the peft adapter layers were merged with the base model, using:

python merge_peft_adapter.py --adapter_model_name=XXX --base_model_name=YYY --output_name=ZZZ

Note that this script requires peft>=0.3.0.

To evaluate the bias of finetuned and debiased GPT-Neo models, run:

python self_debiasing.py --models <MODEL_1> <MODEL_2> ... --modes default debiased

For LLAMA models, run:

python self_debiasing_llama.py --models <MODEL_1> <MODEL_2> ... --modes default debiased

To evaluate the perplexity of finetuned and debiased models, run:

python eval_perplexity.py --models <MODEL_1> <MODEL_2> ... --modes default debiased