Mistral

Mistral: A strong and cool northwesterly wind that builds as it moves, bringing good health and clear skies.

A framework for transparent and accessible large-scale language model training, built with Hugging Face 🤗 . Includes tools and helpful scripts for incorporating new pre-training datasets, various schemes for single node and distributed training - including on cloud providers like GCP, and importantly, scripts for evaluation.

Visit our Read the Docs for the full documentation.

A Propulsion Endeavor 🚀

Community

Mistral is built to facilitate transparent and accessible training. To do our best to reach this goal, we will hold community meetings twice a month we'll give updates as to where we're at and what we're working on, and more importantly, hear from you as to how we can help and possibly work together.

We would love for folks from academia, other community efforts, as well as those in industry to join - all are welcome. The first meeting will be on Monday, August 30th at 4 PM PT.

We'll post the future dates (and times - which we hope to move around through the day to maximally engage folks in varied timezones) after the first meeting!

Quickstart

Installation

The dependencies for Mistral can be installed using Conda. Note that the provided environment assumes that CUDA 11.0 is installed. You may need to adjust the environment YAML file depending on your set up.

git clone https://github.com/stanford-crfm/mistral.git
cd mistral
conda env create -f environments/environment-gpu.yaml  # Choose CUDA kernel based on the hardware!

If you are training on the CPU only, run conda env create -f environments/environment-cpu.yaml instead.

Training GPT-2 Micro

Prerequisites

First, make sure to update conf/tutorial-gpt2-micro.yaml with the directories you want to store the Hugging Face cache and model runs.

# Artifacts & Caching
artifacts:
    cache_dir: /path/to/artifacts
    run_dir: /path/to/runs

Next, make sure that /path/to/mistral is on your PYTHONPATH.

Single-node single-GPU training

For single-node single-gpu training, run:

conda activate mistral
cd mistral
CUDA_VISIBLE_DEVICES=0 python train.py --config conf/tutorial-gpt2-micro.yaml --nnodes 1 --nproc_per_node 1 --training_arguments.fp16 true --training_arguments.per_device_train_batch_size 2 --run_id tutorial-gpt2-micro

Multi-node multi-GPU training with DeepSpeed

Modify /job/hostfile in the following way:

<Hostname of first machine> slots=<Number of GPUs>
<Hostname of second machine> slots=<Number of GPUs>
...
<Hostname of the nth machine> slots=<Number of GPUs>

Below is an example hostfile where we train on machine1 and machine2 with 8 GPUs each:

machine1 slots=8
machine2 slots=8

To start distributed training, run:

conda activate mistral
cd mistral
deepspeed --num_gpus 8 --num_nodes 2 --master_addr machine1 train.py --config conf/tutorial-gpt2-micro.yaml --nnodes 2 --nproc_per_node 8 --training_arguments.fp16 true --training_arguments.per_device_train_batch_size 4 --training_arguments.deepspeed conf/deepspeed/z1-conf.json --run_id tutorial-gpt2-micro-multi-node > tutorial-gpt2-micro-multi-node.out 2> tutorial-gpt2-micro-multi-node.err

Note: You may need to adjust your batch size depending on the capacity of your GPUs.

If you are interested in training a model on Google Cloud, check out our Google Cloud + Kubernetes Tutorial.

Using the model

Model checkpoints will be stored in the directory specified by the artifacts.run_dir. An example checkpoint might be in /path/to/runs/tutorial-gpt2-micro/checkpoint-1000.

Mistral stores model checkpoints in the Hugging Face format, so models can be loaded and used in the same manner as if one had trained the model with Hugging Face.

For instance, to generate text with 🤗 Transformers (you will need to clone the transformers repo):

conda activate mistral
cd transformers/examples/text-generation
python run_generation.py --model_type=gpt2 --model_name_or_path=/path/to/runs/tutorial-gpt2-micro/checkpoint-1000

Or to load the model in Python code (make sure /path/to/mistral is in your PYTHONPATH):

from src.models.mistral_gpt2 import MistralGPT2LMHeadModel

model = MistralGPT2LMHeadModel.from_pretrained("/path/to/runs/tutorial-gpt2-micro/checkpoint-1000")

Resources

The Propulsion team has trained 5 GPT-2 Medium models and 5 GPT-2 Small models on the OpenWebText corpus, as found in 🤗 datasets.

Checkpoints can be loaded as Hugging Face models. For each model, we provide checkpoints at 100k, 200k, 300k and 400k steps.

We have also stored over 600 checkpoints for each model, subject to the following checkpoint schedule:

Every 10 Steps, for the first 0 - 100 Steps.
Every 50 Steps, from 100 - 2000 Steps.
Every 100 Steps, from 2000 - 20,000 Steps.
Every 1000 Steps, from 20,000 - 400,000 Steps.

This comes out to 610 checkpoints per run, taking up ~22TB for all 10 models (making it pretty expensive to host!) If you are interested in acquiring these additional checkpoints, please file an issue or contact Laurel (lorr1) and Sidd (skaramcheti) at their @cs.stanford.edu email addresses, and we'll be happy to figure out a cost-effective solution to sharing them.

GPT-2 Medium

Run	Type	Checkpoint	Size	Link
Arwen	GPT-2 Medium	400000	4.9G	download
Arwen	GPT-2 Medium	300000	4.9G	download
Arwen	GPT-2 Medium	200000	4.9G	download
Arwen	GPT-2 Medium	100000	4.9G	download
Beren	GPT-2 Medium	400000	4.9G	download
Beren	GPT-2 Medium	300000	4.9G	download
Beren	GPT-2 Medium	200000	4.9G	download
Beren	GPT-2 Medium	100000	4.9G	download
Celebrimbor	GPT-2 Medium	400000	4.9G	download
Celebrimbor	GPT-2 Medium	300000	4.9G	download
Celebrimbor	GPT-2 Medium	200000	4.9G	download
Celebrimbor	GPT-2 Medium	100000	4.9G	download
Durin	GPT-2 Medium	400000	4.9G	download
Durin	GPT-2 Medium	300000	4.9G	download
Durin	GPT-2 Medium	200000	4.9G	download
Durin	GPT-2 Medium	100000	4.9G	download
Eowyn	GPT-2 Medium	400000	4.9G	download
Eowyn	GPT-2 Medium	300000	4.9G	download
Eowyn	GPT-2 Medium	200000	4.9G	download
Eowyn	GPT-2 Medium	100000	4.9G	download

GPT-2 Small

Run	Type	Checkpoint	Size	Link
Alias	GPT-2 Small	400000	1.8G	download
Alias	GPT-2 Small	300000	1.8G	download
Alias	GPT-2 Small	200000	1.8G	download
Alias	GPT-2 Small	100000	1.8G	download
Battlestar	GPT-2 Small	400000	1.8G	download
Battlestar	GPT-2 Small	300000	1.8G	download
Battlestar	GPT-2 Small	200000	1.8G	download
Battlestar	GPT-2 Small	100000	1.8G	download
Caprica	GPT-2 Small	400000	1.8G	download
Caprica	GPT-2 Small	300000	1.8G	download
Caprica	GPT-2 Small	200000	1.8G	download
Caprica	GPT-2 Small	100000	1.8G	download
Darkmatter	GPT-2 Small	400000	1.8G	download
Darkmatter	GPT-2 Small	300000	1.8G	download
Darkmatter	GPT-2 Small	200000	1.8G	download
Darkmatter	GPT-2 Small	100000	1.8G	download
Expanse	GPT-2 Small	400000	1.8G	download
Expanse	GPT-2 Small	300000	1.8G	download
Expanse	GPT-2 Small	200000	1.8G	download
Expanse	GPT-2 Small	100000	1.8G	download

Issues

To ask questions, report issues, or request features, please use the GitHub Issue Tracker. Before creating a new issue, please make sure to search for existing issues that may solve your problem.

Contributing

Please see the following page for information on contributing.

Eric-Wallace / mistral