Did you increase the decoupled weight decay simultaneously when decreasing the learning rate?
xiangning-chen opened this issue · comments
Thanks for implementing and testing our lion optimizer!
Just wondering did you also enlarge the decoupled weight decay to maintain the regularization strength?
best,
--xiangning
@xiangning-chen Hi Xiangning! Thank you for this interesting paper
So far I have been only testing with weight decay turned off. There are a lot of networks that are still trained with just plain Adam, and I wanted to see how Lion fares against Adam alone
@xiangning-chen but yes, I have noted the section in the paper where you said the weight decay needs to be higher
Let me add that to the readme to increase the chances people train it correctly
Thanks for the update!
Yeah disabling weight decay for both optimizers is pretty meaningful and fair, thank you!
@xiangning-chen ok good luck! hope this technique holds up to scrutiny!