Run fairseq-train with SLURM srun/sbatch

Create conda env

conda create -yn fsdp_1T python=3.8
conda activate fsdp_1T

Checkout and build PyTorch from source (for EFA support see the instuctions)

conda install -y astunparse numpy ninja pyyaml mkl mkl-include setuptools cmake cffi typing_extensions future six requests dataclasses
conda install -y -c pytorch magma-cuda110

git clone --recursive git@github.com:pytorch/pytorch.git
cd pytorch
TORCH_CUDA_ARCH_LIST=8.0 python setup.py install
cd ..

Clone and install pbelevich/fairscale from source

git clone git@github.com:pbelevich/fairscale.git pbelevich-fairscale
cd pbelevich-fairscale
pip install -e .
cd ..

Clone and install pbelevich/fairseq from branch fsdp_1T

git clone -b fsdp_1T git@github.com:pbelevich/fairseq.git pbelevich-fairseq
cd pbelevich-fairseq
pip install -e .
cd ..

Install deepspeed

pip install deepspeed

Clone and build NVIDIA/apex

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
  --global-option="--deprecated_fused_adam" --global-option="--xentropy" \
  --global-option="--fast_multihead_attn" ./
cd ..

[No need if you use fsdp_1T@pbelevich/fairseq] Quick fix fairseq-deepspeed issue: Open fairseq/optim/cpu_adam.py and add , False to the line 116

Clone this repo

git clone https://github.com/pbelevich/fsdp_1T.git
cd fsdp_1T

Preprocess the data for RoBERTa

wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
unzip wikitext-103-raw-v1.zip

mkdir -p gpt2_bpe
wget -O gpt2_bpe/encoder.json https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json
wget -O gpt2_bpe/vocab.bpe https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe
for SPLIT in train valid test; do \
    python -m examples.roberta.multiprocessing_bpe_encoder \
        --encoder-json gpt2_bpe/encoder.json \
        --vocab-bpe gpt2_bpe/vocab.bpe \
        --inputs wikitext-103-raw/wiki.${SPLIT}.raw \
        --outputs wikitext-103-raw/wiki.${SPLIT}.bpe \
        --keep-empty \
        --workers 60; \
done

wget -O gpt2_bpe/dict.txt https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/dict.txt
fairseq-preprocess \
    --only-source \
    --srcdict gpt2_bpe/dict.txt \
    --trainpref wikitext-103-raw/wiki.train.bpe \
    --validpref wikitext-103-raw/wiki.valid.bpe \
    --testpref wikitext-103-raw/wiki.test.bpe \
    --destdir data-bin/wikitext-103 \
    --workers 60

Run fairseq-train with SLURM sbatch (output to the file slurm-XXXXX.out)

sbatch fairseq_fsdp_sbatch.sh

To see the log:

tail -f -n +1 slurm-XXXXX.out

Run fairseq-train with SLURM srun (output to the screen)

./fairseq_fsdp_interactive.sh

mrshenli / fsdp_1T

Run fairseq-train with SLURM srun/sbatch

About

Languages