stanford-crfm / mistral

Mistral: A strong, northwesterly wind: Framework for transparent and accessible large-scale language model training, built with Hugging Face 🤗 Transformers.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Replicating training / test split on models

Uzay-G opened this issue · comments

Hello,
We are running some experiments on Mistral models and it would be useful if we knew how the openwebtext train-test split was done to train the models. It would allow us to replicate this split and evaluate the models using openwebtext / without leakage.
Thanks for your help.

commented

@siddk would know better, but my first guess is that code in auto.py performed the split, so ultimately the HF dataset method train_test_split with the validation ratio = 0.0005 ... I'm guessing this was done once and every random seed experiment used the same split ... I am unsure which random seed was used for the initial data processing ...

code in Mistral:

if "validation" not in dataset:

code in HF Datasets:

https://github.com/huggingface/datasets/blob/6d247bd4fd76b45998747ecc3367daab5f5e0b82/src/datasets/arrow_dataset.py#L3645

commented

If I had to guess I would assume it was done with seed=42 but that could certainly be wrong ... I just note 42 is the default seed when no seed is specified ...

commented

Honestly I am really unclear on what random seed was used for the data preprocessing which means it is kind of difficult to perfectly replicate the data split ...

commented

Here are some more details from @siddk

All OpenWebText data was processed once, via a call to get_auto_dataset (https://github.com/stanford-crfm/mistral/blob/main/src/corpora/auto.py#L94) using the first model’s config (alias-gpt2-small) with seed = 21.
This all happened on a single node, the remaining part of the config that’s important is here: https://github.com/stanford-crfm/mistral/blob/main/conf/datasets/openwebtext.yaml.
Basically — 64 workers for training, 4 works for eval, validation ratio of 0.0005.
If you just run train.py from a single process point it at the mistral-small.yaml config, should be equivalent