lucidrains / PaLM-rlhf-pytorch

Implementation of RLHF (Reinforcement Learning with Human Feedback) on top of the PaLM architecture. Basically ChatGPT but with PaLM

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Should critic's input be prompt only?

ginward opened this issue · comments

In the PPO implementation, it seems that the critic model considers both prompt and generated actions as the input (if pooled is true, then generated actions only). However, if we see prompt as S_t and prompt with action as S_t+T, shouldn't the value function be V(S_t) but not V(S_t+T)?

In other words, when calculating the advantage function, shouldn't our value function be the average reward for a prompt?