FasterTransformer4CodeFuse

Introduce

Provide high-performance model inference, mainly supporting the CodeFuse model from Ant Group.

Compared to the original FT, this repo has these features:

✅ Int8 quantization of CodeFuse model
✅ Prompt does not require a complete word at the end
✅ Python API
✅ Streaming Output with Python API
✅ Higher model load speed
✅ Some bugfix

Performance

Batch size: 1

Model			CodeFuse 13B
Measurements			Latency (ms)
GPU			Single A100		2 * A100 Tensor Parallelism
Data Type			fp16	int8	fp16	int8
Input/Output Length	16	8	160	195	238	84
	64	32	608	369	373	295
	256	128	2650	1530	1492	1130
	1024	512	10776	7054	6786	5415
Tokens Per Sec			48	75	77	98

Get Start

We run in the container environment: nvcr.io/nvidia/pytorch:22.09-py3。

1. Install requirements

pip install --no-cache-dir pybind11==2.6.2 transformers accelerate sentencepiece

echo "export pybind11_DIR=/opt/conda/lib/python3.8/site-packages/pybind11/share/cmake/pybind11/" >> ~/.bashrc
export pybind11_DIR=/opt/conda/lib/python3.8/site-packages/pybind11/share/cmake/pybind11/

2. Build

mkdir build ; cd build
export TORCH_PYTHON_LIBRARIES=/opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so
cmake -DCMAKE_BUILD_TYPE=Release -DSM="80;75" -DBUILD_PYT=ON -DSPARSITY_SUPPORT=OFF -DMEASURE_BUILD_TIME=ON \
      -DBUILD_CUTLASS_MIXED_GEMM=ON -DBUILD_MULTI_GPU=ON -DBUILD_TRT=OFF \
      -DENABLE_FP8=OFF -DBUILD_PYBIND=ON -DTORCH_PYTHON_LIBRARIES=${TORCH_PYTHON_LIBRARIES} ..
make -j"$(grep -c ^processor /proc/cpuinfo)"

3. Run

You can use examples/pytorch/codefuse/huggingface_convert.py script to convert checkpoint files from HuggingFace to FasterTransformer.

export MODEL_NAME=codefuse
export TENSOR_PARA_SIZE=2

python ../examples/pytorch/codefuse/huggingface_convert.py \
       -o ../models/${MODEL_NAME}/fastertransformer \
       -i ../models/${MODEL_NAME}/transformers \
       -infer_gpu_num ${TENSOR_PARA_SIZE} \
       -processes 20 \
       -weight_data_type fp16 \
       -model_name gptneox

You can use examples/pytorch/codefuse/quant_and_save.py script to convert fp16 or fp32 FasterTransformer checkpoint files to int8 files and scales, getting higher model load speed and smaller checkpoint files.

export MODEL_NAME=codefuse
export TENSOR_PARA_SIZE=2

python ../examples/pytorch/codefuse/quant_and_save.py \
       --in_dir ../models/${MODEL_NAME}/fastertransformer/${TENSOR_PARA_SIZE}-gpu \
       --out_dir ../models/${MODEL_NAME}/fastertransformer/${TENSOR_PARA_SIZE}-gpu_int8 \
       --lib_path ../build/lib/libth_common.so \
       --tensor_para_size ${TENSOR_PARA_SIZE} \
       --use_gptj_residual \
       --data_type fp16

You can use examples/pytorch/codefuse/codefuse_example.py to run model inference.

export MODEL_NAME=codefuse

# fp16 1gpu
python ../examples/pytorch/codefuse/codefuse_example.py \
       --ckpt_path ../models/${MODEL_NAME}/fastertransformer/1-gpu \
       --tokenizer_path ../models/${MODEL_NAME}/transformers

# int8 1gpu
python ../examples/pytorch/codefuse/codefuse_example.py \
       --ckpt_path ../models/${MODEL_NAME}/fastertransformer/1-gpu_int8 \
       --tokenizer_path ../models/${MODEL_NAME}/transformers \
       --int8_mode 1 \
       --enable_int8_weights 1

# fp16 2gpus
torchrun --nproc_per_node 2 ../examples/pytorch/codefuse/codefuse_example.py \
         --world_size 2 \
         --ckpt_path ../models/${MODEL_NAME}/fastertransformer/2-gpu \
         --tokenizer_path ../models/${MODEL_NAME}/transformers

# int8 2gpus
torchrun --nproc_per_node 2 ../examples/pytorch/codefuse/codefuse_example.py \
         --world_size 2 \
         --ckpt_path ../models/${MODEL_NAME}/fastertransformer/2-gpu_int8 \
         --tokenizer_path ../models/${MODEL_NAME}/transformers \
         --int8_mode 1 \
         --enable_int8_weights 1

codefuse-ai / FasterTransformer4CodeFuse