The implementation of layerwise learning rate decay
importpandas opened this issue · comments
Lines 188 to 193 in 7911132
According to the code here, assume that n_layers=24
, then key_to_depths["encoder/layer_23/"] = 24
which is the depth for last encoder layer, but the learning rate for last layer is:
learning_rate * (layer_decay ** (24+ 2 - 24)) = learning_rate * (layer_decay ** (2))
.
That's what confused me. Why the learning rate for last layer is learning_rate * (layer_decay ** (2))
rather than learning_rate
? Do I ignore anything?
For the layerwise learning rate decay we count task-specific layer added on top of the pre-trained transformer as additional layer of the model, so the learning rate for the last layer of ELECTRA should be learning_rate * 0.8. But you've still found a bug, where instead it is learning_rate * 0.8^2.
The bug happened because there used to be a pooler layer in ELECTRA before we removed the next-sentence-prediction task. In that case the learning rates per layer were
- task-specific softmax: learning_rate
- pooler: learning_rate * 0.8
- transformer layer 24: learning_rate * 0.8^2
- transformer layer 23: learning_rate * 0.8^3
- ...
However, when we removed the pooling layer, we didn't fix the learning rates correspondingly. I guess in practice this didn't hurt performance much, so I'm leaving it as-is to keep result reproducible for now.
I got it, thanks for your explanation.