lucidrains / PaLM-rlhf-pytorch

Implementation of RLHF (Reinforcement Learning with Human Feedback) on top of the PaLM architecture. Basically ChatGPT but with PaLM

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Unified reward function/model architecture for a wide range of tasks

James4Ever0 opened this issue · comments

I find the reward function to be the most important part of RLHF, because it is the part which mimics a human evaluator, providing instant feedback to the model.

However, due to ChatGPT's wide range of language capabilities, it is hard to model such reward function with a single model to be prompt dependent, context aware, leveraging existing knowledge from pretrained models.

Most projects relating to RLHF usually use toy-like reward functions such as counting word frequencies, checking output formats, or just sentiment/fluency scores. These functions do not "think" like the human evaluator, considering every factor as a whole. RL4LMs propose GRUE in which the model performs general instructions but it does not expose a simple unified interface to get a score given prompt and answer.

RL4LMs contains a registry of reward functions which I find it complex and not leveraging current (by current I mean the SFT model we are working on, in this case, PaLM) pretrained models. I think a reward function should be an integrated part of the language model itself, rather than outsourcing it to other models with different architectures which require separate pre-training and fine-tuning, able to attribute the reward to fine-grained sections of outputs.

RLHF requires creating multiple models like SFT, RM, PPO-tuned model. Is it possible to improve storage and memory efficiency, reduce computation if we freeze some huge layers of the pretrained model, only fine-tune certain layers to create SFT, RM, PPO using OpenDelta or other libraries/methods? I read that your repo is using LoRA but I'm not sure if it fulfills all goals described above. Common implementations like minRLHF requires four separate models, three are derived from the pretrained model as actor, critic and reference, in addition to an external sentiment rating model.

To address this proposal even further, I think a good reward function can self-evolve and adapt to new environments (when the data source is no longer fixed static "archives" but streaming), making this model communicative, multipurpose, realtime and even into AGI. A good reward function can let the agent to learn from almost anything, including human feedback, computer system (sensor data, terminal/GUI input/output, internet, program threads and more) and self-invented signals. WebGPT is a clear example to make GPT3 into an active agent. There will be more to come.