Question about learning rate and optimizer

Question

Question about learning rate and optimizer

Osama26-byte opened this issue 2 months ago · comments

Hi, thanks for this great work. I was training the PTv3 model on waymo dataset on a single RTX3090 24Gb with a batch size of 2. I am training with 0.0004 lr but this seems not to work, when the learning rate reaches to 0.0004 with onecycleLR after 5 epochs loss gradually increases.
Can you suggest some way to set the Lr and optimizer and onecycleLR parameters for 2 batch size on single GPU?

Osama26-byte · Answer 1 · Sat Apr 27 2024 17:04:31 GMT+0800 (China Standard Time)

These are my loss and learning rate curve for model training on single GPU with 2 batch size on Waymo dataset. After 2nd epoch it went NaN.

my optimizer and lr_scheduler config are:

optimizer = dict(type="AdamW", lr=0.0008, weight_decay=0.005)

lr_config = dict( type="OneCycleLR", max_lr=0.0008, pct_start=0.04, anneal_strategy="cos", div_factor=10.0, final_div_factor=100.0, )

plus i am using grad_clip too:
optimizer_config = dict(grad_clip=dict(max_norm=35, norm_type=2))

Xiaoyang Wu · Answer 2 · Mon Apr 29 2024 16:08:49 GMT+0800 (China Standard Time)

Hi, as with any MSA-based attention, the training parameter is sensitive. Also, we don't recommend training the model with a small batch size.

Osama26-byte · Answer 3 · Mon Apr 29 2024 16:42:50 GMT+0800 (China Standard Time)

Thanks for reply. Another question is I investigated the optimizer code, you are distributing model parameters in to 2 groups one with 'block' keyword in them and other with all remaining parameters. Can you explain this working if I am right this is what you are doing and does it will effect the performance if I train model with no parameter division?

Xiaoyang Wu · Answer 4 · Mon Apr 29 2024 21:16:40 GMT+0800 (China Standard Time)

So, one tech for training attention is scaling the learning rate of the attention block to 1/10 of the default learning rate.