⚡️🤗LitHFT

Pretrain, finetune any LLMs from huggingface on your own data.

What is LitHFT

LitHFT is a tool developed based on LitGPT for pre-training and fine-tuning Language Model (LLMs) from Huggingface.

In comparison, LitGPT is specifically designed for certain types of LLMs and supports over 20 commonly used LLMs. However, it is not applicable to other types of models in Huggingface, such as Qwen. LitHFT is suitable for native Huggingface models without any need for checkpoint conversion, but the trade-off is less training optimization.

Comparison	LitHFT	LitGPT
LLMs	Any	20+
Optimization	Deepspeed	FSDP
Dataloader	packed data from TinyLLama	litdata

Install LitHFT

git clone https://github.com/DoubleVII/lithft.git
cd lithft
pip install -e .

Warning

This project has only been tested on Qwen and Mistral.

Model	Device	Throughput / s / device
Qwen1.5-1.8B	A800-40G	24.5k tokens
Mistral-7B	A100-80G	3.5k tokens

Quick start

Finetune

Data

We use ParquetData as an example here, which contains two columns: prompt and response. You can use any Datamodule supported by LitGPT.

Launch training

huggingface-cli download "Qwen/Qwen1.5-1.8B" --local-dir Qwen1.5-1.8B --local-dir-use-symlinks False

MODEL_DIR=Qwen1.5-1.8B
DATA_PATH=sft_data.parquet # columns: `prompt` and `response`

fabric run model \
--accelerator=cuda \
--devices=8 \
launch/finetune.py \
--checkpoint_dir $MODEL_DIR \
--data ParquetData \
--data.data_path $DATA_PATH \
--train.learning_rate=1e-5 \
--train.lr_warmup_steps=100 \
--train.micro_batch_size=16 \
--train.epochs=1 \
--train.save_interval=10000 \
--train.global_batch_size=64 \
--train.log_interval=1 \
--out_dir out

Convert to Huggingface

The model's state dict is not changed during training, it is simply extracted:

python3 litgpt/scripts/convert_hf_fast.py out/final/lit_model.pth/checkpoint/mp_rank_00_model_states.pt Qwen1.5-1.8B-finetuned

The saved model file is Qwen1.5-1.8B-finetuned/pytorch_model.bin. You may need to copy some metadata to Qwen1.5-1.8B-finetuned to load the model directly via AutoModelForCausalLM.

Pretrain

Data

For pretraining, LitHFT uses dataloader from TinyLLama (with some modifications). Use the following script to process all parquet files under pretrain_data into binary data. All parquet files should contain the column text.

huggingface-cli download "Qwen/Qwen1.5-1.8B" --local-dir Qwen1.5-1.8B --local-dir-use-symlinks False

python scripts/prepare_packed_data.py --source_path pretrain_data --destination_path bin/pretrain_data --tokenizer_path Qwen1.5-1.8B --prefix data_part

Launch distributed training

For rank=0 use the following command to start, for the other nodes you should modify the value of NODE_RANK.

NODE_RANK=0
MODEL_DIR=Qwen1.5-1.8B
DATA_PATH=bin/pretrain_data

fabric run model \
--node-rank=$NODE_RANK \
--main-address=$RANK0_ADDR \
--main-port=$RANK0_PORT \
--accelerator=cuda \
--devices=8 \
--num-nodes=2 \
launch/pretrain.py \
--model_config $MODEL_DIR \
--data PackedData \
--data.data_path $DATA_PATH \
--data.shuffle False \
--data.file_prefixes data_part \
--train.learning_rate 4e-4 \
--train.lr_warmup_steps=200 \
--train.micro_batch_size=3 \
--train.max_tokens=1000000000000 \
--train.save_interval=10000 \
--train.log_interval=1 \
--zero3=False \
--out_dir out

Convert to Huggingface

Same as the finetuning section.

Continue training

Continue training can be performed by loading a pre-trained model, specifying the model directory by initial_checkpoint_dir flag:

NODE_RANK=0
MODEL_DIR=Qwen1.5-1.8B
DATA_PATH=bin/pretrain_data

fabric run model \
--node-rank=$NODE_RANK \
--main-address=$RANK0_ADDR \
--main-port=$RANK0_PORT \
--accelerator=cuda \
--devices=8 \
--num-nodes=2 \
launch/pretrain.py \
--initial_checkpoint_dir $MODEL_DIR \
--data PackedData \
--data.data_path $DATA_PATH \
--data.shuffle False \
--data.file_prefixes data_part \
--train.learning_rate 4e-4 \
--train.lr_warmup_steps=200 \
--train.micro_batch_size=3 \
--train.max_tokens=1000000000000 \
--train.save_interval=10000 \
--train.log_interval=1 \
--zero3=False \
--out_dir out

License

LitHFT is released under the MIT license. Additionally, this project comes with LitGPT’s Apache 2.0 license.

1250658183 / lithft

⚡️🤗LitHFT

What is LitHFT

Install LitHFT

Warning

Quick start

Finetune

Data

Launch training

Convert to Huggingface

Pretrain

Data

Launch distributed training

Convert to Huggingface

Continue training

License

About

Languages