hpcaitech / OPT-Benchmark

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

OPT-Benchmark

This benchmark is to compare the performance of Colossal-AI and DeepSpeed in terms of its zero redundancy optimizer and offloading. The script is adapted from the Hugging Face example.

Run Benchmarking

First, you need to install the following libraries.

# assuming using cuda 11.3
conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch
pip install colossalai==0.1.10+torch1.11cu11.3 -f https://release.colossalai.org
pip install accelerate==0.10.0 datasets==1.18.4 transformers==4.21.0 deepspeed==0.6.5 tqdm

To run the benchmarking with different acceleration libraries, you can just execute the following bash script on a single node. We recommend you to run the run_opt_clm.sh script with one GPU first so as to download all the necessary files from Hugging Face.

# run with deepspeed zero 3 + offloading
bash ./run_opt_clm.sh

# run with the current version of colossal-ai zero module
bash ./run_opt_clm_colossalai.sh

# run with the newer (experimental) version of colossal-ai zero module
bash ./run_opt_clm_colossalai_new.sh

Each script has 4 arguments.

  • BS: batch size per GPU
  • MEMCAP: whether to limit the GPU memory usage. For example, if MEMCAP = 40, the program will only use 40 GB memory even if the GPU has 80 GB. If MEMCAP=0, there is limit on the available GPU memory. The default value is 0.
  • MODEL: the variant of the OPT MODEL, default is 13B.
  • GPUNUM: the number of GPUs to use, default is 8.

Sometimes you might encounter OOM with Colossal-AI, please tune the parameters in colossalai_zero.py. More specifically, you could try to decrease the parameter warmup_non_model_data_ratio and gpu_margin_mem_ratio.

# try to decrase warmup_non_model_data_ratio and gpu_margin_mem_ratio
# if you encounter OOM error
zero = dict(model_config=dict(shard_strategy=TensorShardStrategy(),
                              tensor_placement_policy=_policy,
                              reuse_fp16_shard=True,
                              warmup_non_model_data_ratio=0.7),
            optimizer_config=dict(gpu_margin_mem_ratio=0.8, initial_scale=2**8))

Test Results

We run our code with the hardware platform

#GPUs per Node 8
GPU A100 (80G)
CPU Memory Per Node 1900 GB
#vCPU 110
RDMA Yes

The followings are results for Colossal-AI vs DeepSpeed

#Node #GPUs Model System Policy Batch Size
Per GPU
Global Batch Size Step Time Max Allocated Max Reserved Throughput
(sample per second)
1 1 OPT-13B DeepSpeed ZeRO3 24 24 51.74 32.38 77.31 0.463
32 32 64.88 41.88 72.65 0.493
Colossal-AI auto
(warmup_non_model_data_ratio=0.7,
gpu_margin_mem_ratio=0.8)
24 24 41.50 71.05 77.06 0.578
32 32 OOM
auto
(warmup_non_model_data_ratio=0.4,
gpu_margin_mem_ratio=0.5)
32 32 51.50 72.73 77.21 0.621
cpu 32 32 91.68 45.33 76.38 0.349
8 OPT-30B DeepSpeed ZeRO3 16 128 73.95 32.19 76.38 1.73
32 256 99.86 59.89 76.59 2.56
Colossal-AI auto 16 128 37.48 63.61 76.22 3.41
32 256 OOM
cpu 32 256 84.78 65.21 75.63 3.02

About

License:Apache License 2.0


Languages

Language:Python 97.5%Language:Shell 2.5%