This repository contains a minimal example of fine-tuning autoregressive language models (e.g. Llama 2) on a language modeling task without LoRA. By default, it uses Hugging Face's accelerate
and DeepSpeed ZeRO-3 to fit large models on multiple (two in this case) GPUs. Flash attention 2 and torch.bfloat16
are also used.
One benefit of this setup is that the code itself is independent of both the usage of multi-GPU and memory optimization provided by DeepSpeed. This means that you can easily adapt to your own experiments without changing the code itself (if you use Hugging Face Trainer). The only modification you need to make is to the command line, which is provided in the Usage section below.
The code has only been tested on a machine with two NVIDIA A100 GPUs. If you have a different setup, you may need to adjust the code accordingly.
Make sure to install the required packages by running:
python -m pip install -r requirements.txt
There are two configuration files that you need to provide: accelerate_config.yaml
and deepspeed_config.json
, which are available in this repository. They are configured in a way that should work out of the box, but you should still check them to make sure they match your setup.
The accelerate_config.yaml
configuration file can also be obtained by running accelerate config
from the command line. This will start an interactive prompt that will guide you through the process of creating a configuration file. For the DeepSpeed configuration, please refer to the DeepSpeed documentation for a comprehensive list of options.
To fine-tune the model, run the following command:
accelerate launch \
--config_file accelerate_config.yaml \
--use_deepspeed \
--deepspeed_config_file deepspeed_config.json \
finetune.py \
--dataset_name wikitext \
--subset_name wikitext-2-raw-v1 \
--model_name meta-llama/Llama-2-7b-hf \
--num_epochs 5 \
--lr 2e-5 \
--batch_size 4
- Do not use
AutoModelForCausalLM.from_pretrained(..., device_map="auto")
ormodel.to(device)
to move the model to a specific device. Accelerate will handle this for you. - I am using the
paged_adamw_32bit
optimizer here, because the regularadamw_hf
won't fit on two GPUs. Note that the paged version could be slower. Feel free to try if other optimizers work for you.