kvablack / ddpo-pytorch

DDPO for finetuning diffusion models, implemented in PyTorch with LoRA support

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

prompt-dependent value function optimization

hkunzhe opened this issue · comments

commented

I saw you mentioned prompt-dependent value function at #7 (comment). By chance, I happen to be using ddpo for related optimizations. Consider the ideal situation, where there is only one prompt and its corresponding reward function. I still found that in the early stages of training, the reward mean is very fluctuate, even if I increase the training batch size or reduce the learning rate, although the overall reward mean is rising in the end. Are there any optimization techniques to make the optimization of a single prompt prompt stable? Any suggestions or insights would be greatly appreciated.