Xirider / finetune-gpt2xl

Guide: Finetune GPT2-XL (1.5 Billion Parameters) and finetune GPT-NEO (2.7 B) on a single GPU with Huggingface Transformers using DeepSpeed

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Ideal number of epochs? Number of examples meaning?

mallorbc opened this issue · comments

commented

Is there a recommended number of epochs to use? I was able to successfully train on a custom dataset with near 45k entries for the training set and near 11k in the validation set. In the example only 1 epoch set for the flag. However, I have found that training for 4 epochs leads to a lower loss than 1 epoch, and I imagine continuing to train the model would lead to an even better result. It is difficult to say at what point overfitting may start occurring, as the validation data is only evaluated at the end of the training

Thus I ask, is there a rough ideal number of epochs for fine-tuning? If there is, I think it would be a good idea to add that to the README(which I can do if needed).

My second question is related to the Num examples part of training and evaluation. As I said, I have near 45k training texts and near 11k validation texts. However, the Num examples say 1472 and 365 respectfully for training and validation. What does this mean? Is not all the data being used? Why does it not say the much larger numbers of 45k and 11k?

Thanks for the repo and for your help. This is very cool and relatively easy to work with after one gets experience with DeepSpeed

Hi,
you can change how the loss is evaluated against your eval set with --eval_steps. In the example it is evaluated every 15 steps. As long as the eval loss goes down, usually the model will improve. Sadly there are no good rules for how many epochs you need. For me, everything between 1 - 15 epochs worked well, depending on how much data i have and how much i want to overfit. Just set --save_steps to 500 and test each checkpoint for yourself.

The reasons why the number of examples is lower than your training and eval texts, is that run_cml.py concatenates all your texts with EOS (End of Sequence) tokens in between. Then the long string gets split into equal parts of your defined block_size (check line 374 of run_clm.py). You are not loosing any data.