stanford-crfm / mistral

Hello,
We are running some experiments on Mistral models and it would be useful if we knew how the openwebtext train-test split was done to train the models. It would allow us to replicate this split and evaluate the models using openwebtext / without leakage.
Thanks for your help.

@siddk would know better, but my first guess is that code in auto.py performed the split, so ultimately the HF dataset method train_test_split with the validation ratio = 0.0005 ... I'm guessing this was done once and every random seed experiment used the same split ... I am unsure which random seed was used for the initial data processing ...

code in Mistral:

mistral/src/corpora/auto.py

Line 112 in 315560f

if "validation" not in dataset:

code in HF Datasets:

https://github.com/huggingface/datasets/blob/6d247bd4fd76b45998747ecc3367daab5f5e0b82/src/datasets/arrow_dataset.py#L3645

If I had to guess I would assume it was done with seed=42 but that could certainly be wrong ... I just note 42 is the default seed when no seed is specified ...

Honestly I am really unclear on what random seed was used for the data preprocessing which means it is kind of difficult to perfectly replicate the data split ...

Here are some more details from @siddk

All OpenWebText data was processed once, via a call to get_auto_dataset (https://github.com/stanford-crfm/mistral/blob/main/src/corpora/auto.py#L94) using the first model’s config (alias-gpt2-small) with seed = 21.
This all happened on a single node, the remaining part of the config that’s important is here: https://github.com/stanford-crfm/mistral/blob/main/conf/datasets/openwebtext.yaml.
Basically — 64 workers for training, 4 works for eval, validation ratio of 0.0005.
If you just run train.py from a single process point it at the mistral-small.yaml config, should be equivalent

Replicating training / test split on models