OPT-Benchmark

This benchmark is to compare the performance of Colossal-AI and DeepSpeed in terms of its zero redundancy optimizer and offloading. The script is adapted from the Hugging Face example.

Run Benchmarking

First, you need to install the following libraries.

# assuming using cuda 11.3
conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch
pip install colossalai==0.1.10+torch1.11cu11.3 -f https://release.colossalai.org
pip install accelerate==0.10.0 datasets==1.18.4 transformers==4.21.0 deepspeed==0.6.5 tqdm

To run the benchmarking with different acceleration libraries, you can just execute the following bash script on a single node. We recommend you to run the run_opt_clm.sh script with one GPU first so as to download all the necessary files from Hugging Face.

# run with deepspeed zero 3 + offloading
bash ./run_opt_clm.sh

# run with the current version of colossal-ai zero module
bash ./run_opt_clm_colossalai.sh

# run with the newer (experimental) version of colossal-ai zero module
bash ./run_opt_clm_colossalai_new.sh

Each script has 4 arguments.

BS: batch size per GPU
MEMCAP: whether to limit the GPU memory usage. For example, if MEMCAP = 40, the program will only use 40 GB memory even if the GPU has 80 GB. If MEMCAP=0, there is limit on the available GPU memory. The default value is 0.
MODEL: the variant of the OPT MODEL, default is 13B.
GPUNUM: the number of GPUs to use, default is 8.

Sometimes you might encounter OOM with Colossal-AI, please tune the parameters in colossalai_zero.py. More specifically, you could try to decrease the parameter warmup_non_model_data_ratio and gpu_margin_mem_ratio.

# try to decrase warmup_non_model_data_ratio and gpu_margin_mem_ratio
# if you encounter OOM error
zero = dict(model_config=dict(shard_strategy=TensorShardStrategy(),
                              tensor_placement_policy=_policy,
                              reuse_fp16_shard=True,
                              warmup_non_model_data_ratio=0.7),
            optimizer_config=dict(gpu_margin_mem_ratio=0.8, initial_scale=2**8))

Test Results

We run our code with the hardware platform

#GPUs per Node	8
GPU	A100 (80G)
CPU Memory Per Node	1900 GB
#vCPU	110
RDMA	Yes

The followings are results for Colossal-AI vs DeepSpeed

#Node	#GPUs	Model	System	Policy	Batch Size Per GPU	Global Batch Size	Step Time	Max Allocated	Max Reserved	Throughput (sample per second)
1	1	OPT-13B	DeepSpeed	ZeRO3	24	24	51.74	32.38	77.31	0.463
			DeepSpeed	ZeRO3	32	32	64.88	41.88	72.65	0.493
			Colossal-AI	auto (warmup_non_model_data_ratio=0.7, gpu_margin_mem_ratio=0.8)	24	24	41.50	71.05	77.06	0.578
					32	32	OOM
				auto (warmup_non_model_data_ratio=0.4, gpu_margin_mem_ratio=0.5)	32	32	51.50	72.73	77.21	0.621
				cpu	32	32	91.68	45.33	76.38	0.349
	8	OPT-30B	DeepSpeed	ZeRO3	16	128	73.95	32.19	76.38	1.73
			DeepSpeed	ZeRO3	32	256	99.86	59.89	76.59	2.56
			Colossal-AI	auto	16	128	37.48	63.61	76.22	3.41
				auto	32	256	OOM
				cpu	32	256	84.78	65.21	75.63	3.02

About

Apache License 2.0

Languages

Language:Python 97.5%Language:Shell 2.5%