facebookresearch / mae

PyTorch implementation of MAE https//arxiv.org/abs/2111.06377

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

param_groups_lrd for layer decay

1119736939 opened this issue · comments

layer_scales = list(layer_decay ** (num_layers - i) for i in range(num_layers + 1)) in line 25 in lr_decay.py.
The elements in "layer_scales" are increasing, so the learning rates are also "the deeper the layer, the greater the learning rate". I printed the learning rate after execute the "lr_sched.adjust_learning_rate" function. It is "the deeper the layer, the greater the learning rate". But shouldn’t the deeper the layer, the smaller the learning rate. I'm so confused. Please answer my questions. Thanks.

The layers are indexed so that the first block (the one that is closest to the raw input) has index 0, and the last block (the one closest to predicting the logits) has index L - 1. So the later layers do correctly get a larger learning rate.