Performance experiments over AdamW

Question

Performance experiments over AdamW

conceptofmind opened this issue a year ago · comments

Hi Phil,

I have been testing some different Lion hyperparameters with PaLM at the 1B scale (Total batch size 192. ~1.6 million tokens a batch). Using a decoupled weight decay of 0.1 for all runs. So far the best configuration was:

3e-4
betas-90-98

This had about a 0.2 loss improvement over AdamW. The memory consumption was ~4% lower. There was an increase in speed of about 0.14. Lowering the iteration time from 1.65 to 1.51.

Wandb logs:
https://wandb.ai/a_man_chooses/palm/reports/loss-23-05-09-21-43-11---Vmlldzo0MzE0MTcy
https://wandb.ai/a_man_chooses/palm/reports/loss-23-05-09-21-48-41---Vmlldzo0MzE0MjAz

I am going to be testing at the 2B scale next and report the results. I am going to try adjusting the learning rate and betas more as well. I was wondering if you had noticed a significant difference in performance as you increased the size of the model?

Thank you,

Enrico

Enrico Shippole · Answer 1 · Wed May 10 2023 13:44:12 GMT+0800 (China Standard Time)

I did not realize there was a discussion thread.