lucidrains / lion-pytorch

🦁 Lion, new optimizer discovered by Google Brain using genetic algorithms that is purportedly better than Adam(w), in Pytorch

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Performance experiments over AdamW

conceptofmind opened this issue · comments

Hi Phil,

I have been testing some different Lion hyperparameters with PaLM at the 1B scale (Total batch size 192. ~1.6 million tokens a batch). Using a decoupled weight decay of 0.1 for all runs. So far the best configuration was:

  • 3e-4
  • betas-90-98

This had about a 0.2 loss improvement over AdamW. The memory consumption was ~4% lower. There was an increase in speed of about 0.14. Lowering the iteration time from 1.65 to 1.51.

Wandb logs:
https://wandb.ai/a_man_chooses/palm/reports/loss-23-05-09-21-43-11---Vmlldzo0MzE0MTcy
https://wandb.ai/a_man_chooses/palm/reports/loss-23-05-09-21-48-41---Vmlldzo0MzE0MjAz

I am going to be testing at the 2B scale next and report the results. I am going to try adjusting the learning rate and betas more as well. I was wondering if you had noticed a significant difference in performance as you increased the size of the model?

Thank you,

Enrico

I did not realize there was a discussion thread.