Question about learning rate and optimizer
Osama26-byte opened this issue · comments
Hi, thanks for this great work. I was training the PTv3 model on waymo dataset on a single RTX3090 24Gb with a batch size of 2. I am training with 0.0004 lr but this seems not to work, when the learning rate reaches to 0.0004 with onecycleLR after 5 epochs loss gradually increases.
Can you suggest some way to set the Lr and optimizer and onecycleLR parameters for 2 batch size on single GPU?
These are my loss and learning rate curve for model training on single GPU with 2 batch size on Waymo dataset. After 2nd epoch it went NaN.
my optimizer and lr_scheduler config are:
optimizer = dict(type="AdamW", lr=0.0008, weight_decay=0.005)
lr_config = dict( type="OneCycleLR", max_lr=0.0008, pct_start=0.04, anneal_strategy="cos", div_factor=10.0, final_div_factor=100.0, )
plus i am using grad_clip too:
optimizer_config = dict(grad_clip=dict(max_norm=35, norm_type=2))
Hi, as with any MSA-based attention, the training parameter is sensitive. Also, we don't recommend training the model with a small batch size.
Thanks for reply. Another question is I investigated the optimizer code, you are distributing model parameters in to 2 groups one with 'block' keyword in them and other with all remaining parameters. Can you explain this working if I am right this is what you are doing and does it will effect the performance if I train model with no parameter division?
So, one tech for training attention is scaling the learning rate of the attention block to 1/10 of the default learning rate.