Should critic's input be prompt only?
ginward opened this issue · comments
Jinhua Wang commented
In the PPO implementation, it seems that the critic model considers both prompt and generated actions as the input (if pooled is true, then generated actions only). However, if we see prompt as S_t and prompt with action as S_t+T, shouldn't the value function be V(S_t) but not V(S_t+T)?
In other words, when calculating the advantage function, shouldn't our value function be the average reward for a prompt?