param_groups_lrd for layer decay

Question

param_groups_lrd for layer decay

1119736939 opened this issue a year ago · comments

layer_scales = list(layer_decay ** (num_layers - i) for i in range(num_layers + 1)) in line 25 in lr_decay.py.
The elements in "layer_scales" are increasing, so the learning rates are also "the deeper the layer, the greater the learning rate". I printed the learning rate after execute the "lr_sched.adjust_learning_rate" function. It is "the deeper the layer, the greater the learning rate". But shouldn’t the deeper the layer, the smaller the learning rate. I'm so confused. Please answer my questions. Thanks.

Alex Li · Answer 1 · Tue Dec 19 2023 05:30:00 GMT+0800 (China Standard Time)

The layers are indexed so that the first block (the one that is closest to the raw input) has index 0, and the last block (the one closest to predicting the logits) has index L - 1. So the later layers do correctly get a larger learning rate.