mosaicml / llm-foundry

LLM training code for Databricks foundation models

Home Page:https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Train using RefinedWeb dataset instead of C4

maxilie opened this issue · comments

When training the next generation MPT base model, I suggest you swap out the C4 dataset for the RefinedWeb dataset.

The filtering methods used by RefinedWeb have produced more capable models (including the Falcon models) than those, like MPT, that train on the C4 dataset.

Admittedly, the C4 dataset appears to be almost as good as RefinedWeb. Table 4 in their paper shows that training on RefinedWeb only increases model performance by less than 1% vs the C4 dataset, as judged by their multi-domain evaluation set.

Still, RefinedWeb seems to be the superior dataset, and I think that the MPT models can get a bit closer to the performance of the Falcon models by using it.

HI @maxilie!

We are of course always interested in improving and diversify our datasets and sources.

Still, RefinedWeb seems to be the superior dataset, and I think that the MPT models can get a bit closer to the performance of the Falcon models by using it.

It is worth pointing out that MPT-7B was trained for 1T tokens and Falcon 7B was trained for 1.5T tokens. We actually recently released an updated MPT-7B with 8k context length that was trained for an additional 500B tokens. As measured by our model gauntlet our new 7B-8k is superior in all but one category of competency (commonsense_reasoning). You can read about that model here .

Screenshot 2023-07-31 at 8 57 46 AM

You can also checkout our eval page to see how other models perform at these categories.

llm-eval

Thanks @maxilie , for the suggestion! While our foundry examples use C4, the MPT model itself was trained on our own curated dataset mixture, see the data section here: https://www.mosaicml.com/blog/mpt-7b