Loss curve differences for pretraining

Question

Loss curve differences for pretraining

maxidl opened this issue 8 months ago · comments

Hi,
we are currently pretraining a 7b model on ~1T tokens, and our loss curve looks similar to the one in the MPT-7b blog post, figure 6.
I am wondering, why in this curve the loss value is always above 2.0, while for other training runs (e.g. LLaMa-1, figure 1, LLaMa-2, figure 5 it quickly goes below 2.0 .
I get that in the end the MPT model performs well, but can this difference in loss scores be explained?

Potential reasons I can think of:
FSDP vs other (e.g. DeepSpeed, but unknown for LLaMa)?

Any pointers appreciated.

Daniel King · Answer 1 · Fri Jan 26 2024 01:56:53 GMT+0800 (China Standard Time)

The loss depends on the exact dataset used and the tokenizer used, and so comparing across different training runs that use different datasets and different tokenizers is not meaningful.