The is an example of using megatron to pretrain GPT2
git clone https://github.com/wangcaihua/megatron_gpt.git
git remote set-url origin https://<githubtoken>@github.com/wangcaihua/megatron_gpt.git
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod +x Miniconda3-latest-Linux-x86_64.sh
sh Miniconda3-latest-Linux-x86_64.sh
ln -s ~/miniconda3/lib ~/miniconda3/lib64
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
nvcc --version
conda install -c nvidia cuda-nvprof=11.7
- download and install miniconda
- using
conda
install pytorch, this will install cuda-11.7 - install
cuda-nvprof
to ensure nvprof is 11.7
python -c "import torch; print(torch.__version__)"
# out: 1.13.1+cu117 or 1.13.1
that means your torch
version is 1.13.1, and your required cuda version is 11.7. However, you real cuda version may not be that one. To get the cuda version in you machine,
nvcc -V
If you get a version like cuda compilation tools, release 11.7, V11.7.64
, your version of cuda and torch are match. It not, change you cuda version url. Here is an example:
wget https://developer.download.nvidia.com/compute/cuda/11.7.0/local_installers/cuda_11.7.0_515.43.04_linux.run
sudo sh cuda_11.7.0_515.43.04_linux.run
rm cuda_11.7.0_515.43.04_linux.run
First, since megatron
depend on JIT, Ninja is required to work through:
sudo apt-get -y install ninja-build
pip install Ninja
Second, install apex, which enable you mixed precision and distributed training in Pytorch:
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
cd .. && rm -rf apex
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout 285068c8108e0e8e6538f54fe27c3ee86c5217a2
git apply ../megatron.patch
python setup.py bdist_wheel
cd dist
pip install megatron-*
cd ../.. && rm -rf Megatron-LM
Get vocab.json
and merges.txt
from huggingface, and rename to:
- vocab.json -> 'gpt2-vocab.json'
- merges.txt -> 'gpt2-merges.txt'
Download data from openai:
python download_dataset.py --data_path=gpt2_data
Convert json
data to IndexDataset:
bash preprocess_gpt.sh
The content of preprocess_gpt.sh
is as following:
python preprocess_data.py \
--input gpt2_data/webtext.train.jsonl \
--output-prefix gpt2_data/my-gpt2 \
--vocab gpt2-vocab.json \
--dataset-impl mmap \
--tokenizer-type GPT2BPETokenizer \
--merge-file gpt2-merges.txt \
--append-eod \
--workers 4 \
--chunk-size 128
Before you run preprocess_gpt.sh
, make sure you change the input path
# BERT-345M-uncased: wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_bert_345m/versions/v0.1_uncased/zip -O megatron_bert_345m_v0.1_uncased.zip
# BERT-345M-cased: wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_bert_345m/versions/v0.1_cased/zip -O megatron_bert_345m_v0.1_cased.zip
GPT-345M: wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_lm_345m/versions/v0.0/zip -O megatron_lm_345m_v0.0.zip
mkdir ckpt
mv megatron_lm_345m_v0.0.zip ckpt
cd ckpt && unzip megatron_lm_345m_v0.0.zip && rm -rf megatron_lm_345m_v0.0.zip
bash gpt_pretrain.sh
bash run_mpi.sh