Loss curve differences for pretraining
maxidl opened this issue · comments
Hi,
we are currently pretraining a 7b model on ~1T tokens, and our loss curve looks similar to the one in the MPT-7b blog post, figure 6.
I am wondering, why in this curve the loss value is always above 2.0, while for other training runs (e.g. LLaMa-1, figure 1, LLaMa-2, figure 5 it quickly goes below 2.0 .
I get that in the end the MPT model performs well, but can this difference in loss scores be explained?
Potential reasons I can think of:
FSDP vs other (e.g. DeepSpeed, but unknown for LLaMa)?
Any pointers appreciated.
The loss depends on the exact dataset used and the tokenizer used, and so comparing across different training runs that use different datasets and different tokenizers is not meaningful.