Train using RefinedWeb dataset instead of C4

Question

Train using RefinedWeb dataset instead of C4

maxilie opened this issue a year ago · comments

When training the next generation MPT base model, I suggest you swap out the C4 dataset for the RefinedWeb dataset.

The filtering methods used by RefinedWeb have produced more capable models (including the Falcon models) than those, like MPT, that train on the C4 dataset.

Admittedly, the C4 dataset appears to be almost as good as RefinedWeb. Table 4 in their paper shows that training on RefinedWeb only increases model performance by less than 1% vs the C4 dataset, as judged by their multi-domain evaluation set.

Still, RefinedWeb seems to be the superior dataset, and I think that the MPT models can get a bit closer to the performance of the Falcon models by using it.

Cody Blakeney · Answer 1 · Mon Jul 31 2023 21:03:45 GMT+0800 (China Standard Time)

HI @maxilie!

We are of course always interested in improving and diversify our datasets and sources.

Still, RefinedWeb seems to be the superior dataset, and I think that the MPT models can get a bit closer to the performance of the Falcon models by using it.

It is worth pointing out that MPT-7B was trained for 1T tokens and Falcon 7B was trained for 1.5T tokens. We actually recently released an updated MPT-7B with 8k context length that was trained for an additional 500B tokens. As measured by our model gauntlet our new 7B-8k is superior in all but one category of competency (commonsense_reasoning). You can read about that model here .

You can also checkout our eval page to see how other models perform at these categories.

llm-eval

Hanlin Tang · Answer 2 · Tue Aug 01 2023 01:49:47 GMT+0800 (China Standard Time)

Thanks @maxilie , for the suggestion! While our foundry examples use C4, the MPT model itself was trained on our own curated dataset mixture, see the data section here: https://www.mosaicml.com/blog/mpt-7b