mosaicml / llm-foundry

LLM training code for Databricks foundation models

Home Page:https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Loss curve differences for pretraining

maxidl opened this issue · comments

Hi,
we are currently pretraining a 7b model on ~1T tokens, and our loss curve looks similar to the one in the MPT-7b blog post, figure 6.
I am wondering, why in this curve the loss value is always above 2.0, while for other training runs (e.g. LLaMa-1, figure 1, LLaMa-2, figure 5 it quickly goes below 2.0 .
I get that in the end the MPT model performs well, but can this difference in loss scores be explained?

Potential reasons I can think of:
FSDP vs other (e.g. DeepSpeed, but unknown for LLaMa)?

Any pointers appreciated.

The loss depends on the exact dataset used and the tokenizer used, and so comparing across different training runs that use different datasets and different tokenizers is not meaningful.