prompt-dependent value function optimization

Question

prompt-dependent value function optimization

hkunzhe opened this issue a year ago · comments

I saw you mentioned prompt-dependent value function at #7 (comment). By chance, I happen to be using ddpo for related optimizations. Consider the ideal situation, where there is only one prompt and its corresponding reward function. I still found that in the early stages of training, the reward mean is very fluctuate, even if I increase the training batch size or reduce the learning rate, although the overall reward mean is rising in the end. Are there any optimization techniques to make the optimization of a single prompt prompt stable? Any suggestions or insights would be greatly appreciated.