bug in cosine learning rate decay?
t-taniai opened this issue · comments
I recalled another issue, which may or may not be a bug.
TrainerConfig
has a parameter final_tokens
for cosine lr decay, which is set to final_tokens=2*len(train_dataset)*blockSize
. To my understanding, it draws a lr curve having many repeated cycles of a cosine function through the training (ie, 1 epoch to decay from 1 to 0, and another epoch to increase from 0 to 1).
I'm not familiar with cos lr, but its proper setting is probably final_tokens=numEpochs*len(train_dataset)*blockSize
(ie, a single decay from 1 to 0 (a half cycle of cos(t)+1 ) through the training) ?
Thank you for bringing this up. Hyperparameters of this repository are certainly not optimal. It seems that the best way to address this is to run a tuner and select the hyperparameters that work best. However, I agree that most probably the final_tokens=numEpochs*len(train_dataset)*blockSize
is a better choice.