Did you increase the decoupled weight decay simultaneously when decreasing the learning rate?

Question

Did you increase the decoupled weight decay simultaneously when decreasing the learning rate?

xiangning-chen opened this issue a year ago · comments

Thanks for implementing and testing our lion optimizer!
Just wondering did you also enlarge the decoupled weight decay to maintain the regularization strength?

best,
--xiangning

Phil Wang · Answer 1 · Thu Feb 16 2023 05:56:08 GMT+0800 (China Standard Time)

@xiangning-chen Hi Xiangning! Thank you for this interesting paper

So far I have been only testing with weight decay turned off. There are a lot of networks that are still trained with just plain Adam, and I wanted to see how Lion fares against Adam alone

Phil Wang · Answer 2 · Thu Feb 16 2023 05:57:05 GMT+0800 (China Standard Time)

@xiangning-chen but yes, I have noted the section in the paper where you said the weight decay needs to be higher

Let me add that to the readme to increase the chances people train it correctly

Xiangning Chen · Answer 3 · Thu Feb 16 2023 05:58:16 GMT+0800 (China Standard Time)

Thanks for the update!
Yeah disabling weight decay for both optimizers is pretty meaningful and fair, thank you!

Phil Wang · Answer 4 · Thu Feb 16 2023 06:01:55 GMT+0800 (China Standard Time)

@xiangning-chen ok good luck! hope this technique holds up to scrutiny!