Pointcept / Pointcept

Pointcept: a codebase for point cloud perception research. Latest works: PTv3 (CVPR'24 Oral), PPT (CVPR'24), OA-CNNs (CVPR'24), MSC (CVPR'23)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question about learning rate and optimizer

Osama26-byte opened this issue · comments

Hi, thanks for this great work. I was training the PTv3 model on waymo dataset on a single RTX3090 24Gb with a batch size of 2. I am training with 0.0004 lr but this seems not to work, when the learning rate reaches to 0.0004 with onecycleLR after 5 epochs loss gradually increases.
Can you suggest some way to set the Lr and optimizer and onecycleLR parameters for 2 batch size on single GPU?

These are my loss and learning rate curve for model training on single GPU with 2 batch size on Waymo dataset. After 2nd epoch it went NaN.

Screenshot from 2024-04-27 14-01-19

Screenshot from 2024-04-27 14-01-35

my optimizer and lr_scheduler config are:

optimizer = dict(type="AdamW", lr=0.0008, weight_decay=0.005)

lr_config = dict( type="OneCycleLR", max_lr=0.0008, pct_start=0.04, anneal_strategy="cos", div_factor=10.0, final_div_factor=100.0, )

plus i am using grad_clip too:
optimizer_config = dict(grad_clip=dict(max_norm=35, norm_type=2))

Hi, as with any MSA-based attention, the training parameter is sensitive. Also, we don't recommend training the model with a small batch size.

Thanks for reply. Another question is I investigated the optimizer code, you are distributing model parameters in to 2 groups one with 'block' keyword in them and other with all remaining parameters. Can you explain this working if I am right this is what you are doing and does it will effect the performance if I train model with no parameter division?

So, one tech for training attention is scaling the learning rate of the attention block to 1/10 of the default learning rate.