bloom-dechonk

A repo for running model shrinking experiments.

References:

A PR that adds bloom model to HF Transformers: huggingface/transformers#17474
Base model checkpoint: (bloom-6b3)
Training logs & config for base model (tr11f-6B3-logs)
The training code is based on run_clm.py from transformers.
[public tensorboard with logs] - updated every 8 hours
[just in case] Cluster-specific script for updating tensorboard every 8 hours - [here]
[Todo:] a list of model downsizing scripts - after merging #1
[Todo:] link to relevant discussion threads

Known issues:

warmup steps and total steps in the training script (below) were chosen by a random guess, they may be suboptimal,
the training / validation splits are not the same as in the main bloom training,
batch skipping is not properly validated; if you restart training, you may (or may not) train on some batches twice,
would be better to make env.sh into a dockerfile, using ubuntu as parent layer

Setup

The code requires recent datasets and a development version of Transformers that implements the Bloom model:

pip install https://github.com/younesbelkada/transformers/archive/ba1d9fc05fda160bda968cc77c4c5dbb21049aa9.zip
pip install datasets==2.2.2 accelerate==0.9.0
DS_BUILD_CPU_ADAM=1 DS_BUILD_AIO=1 DS_BUILD_UTILS=1 pip install deepspeed==0.6.5 \
  --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check

The full installation script can be found in env.sh. It assumes clean ubuntu/debian installation and runs. Please do not run this script before you look inside.

Run experiment

First, compress the model using arbitrary technique

import transformers
model = transformers.BloomForCausalLM.from_pretrained("bigscience/bloom-6b3", use_auth_token=True)
tokenizer = transformers.AutoTokenizer.from_pretrained("bigscience/bloom-6b3", use_auth_token=True)
model = apply_your_model_compression_ideas(model, tokenizer)
model.save_pretrained("./some/folder")
tokenizer.save_pretrained("./some/folder")

Then, run the training script using the following command

export RUN_NAME=TODO_EXP_NAME_HERE
export INPUT_PATH=. SNAPSHOT_PATH=./snapshots LOGS_PATH=./logs OMP_NUM_THREADS=32
export DATASET_NAME_OR_PATH=TODO DATASET_CONFIG_NAME=TODO INITIAL_MODEL_PATH=./some_folder

deepspeed --num_gpus 8 ./run_clm.py --do_train --do_eval \
    --model_name $INITIAL_MODEL_PATH --tokenizer_name $INITIAL_MODEL_PATH \
    --dataset_name $DATASET_NAME_OR_PATH --dataset_config_name $DATASET_CONFIG_NAME --run_name $RUN_NAME \
    --block_size 2048 --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --gradient_accumulation_steps 16 \
    --learning_rate 0.00008 --max_grad_norm 1.0 --lr_scheduler_type cosine --max_steps 31250 --warmup_steps 1000 \
    --adam_epsilon 1e-8 --weight_decay 0.1 --adam_beta1 0.9 --adam_beta2 0.95 --fp16=True --seed 42 \
    --cache_dir $INPUT_PATH/data/cache --output_dir $SNAPSHOT_PATH --overwrite_output_dir=True \
    --logging_dir $LOGS_PATH --report_to tensorboard --logging_first_step --logging_steps 100 \
    --evaluation_strategy steps --eval_steps 100 --prediction_loss_only --eval_subset_size 512 \
    --save_steps 500 --save_total_limit 2 --dataloader_num_workers 8 --deepspeed ds_config.json

Note: depending on your training hardware, you may need to modify ds_config.json to enable zero-3 or offloading. The default settings roughly correspond to zero-2.

The default training hyperparameters were adapted from https://huggingface.co/bigscience/tr11f-6B3-logs/tensorboard?scroll=1#text except learning rate and warmup steps, which were chosen based on model's learning rate during initial checkpoint this code assumes 8 gpus. For a different setup, change gradient_accumulation_steps or per_device_train_batch_size to get the global batch size of 512 sequences or 2^20 (~1M) tokens

Model shrinking code

The code requires recent datasets and a development version of Transformers that implements the Bloom model:

pip install https://github.com/younesbelkada/transformers/archive/ba1d9fc05fda160bda968cc77c4c5dbb21049aa9.zip

Once you have these dependencies you should be able to shrink any Bloom Model by using these arguments from the function downsample_model.py:

Parameter	Description
`--model_name`	Name of the model to downsize - must be on the Hub
`--output_model_name`	Name of the output model - Will be used to push it on the Hub or sve it locally
`--hidden_downsampling_rate`	Downsampling rate of the hidden dimension
`--layer_downsampling_rate`	Downsampling rate of the attention blocks
`--aggregation_strategy`	Aggregation strategy of the weights matrices - must be in [`first` `last`, `mean`]
`--layer_selection_strategy`	Layer selection strategy of the attention layers - must be in [`first` `last`, `step`, `mean`]
`--push_to_hub`	Flag enabling pushing the shrinked the model on the Hub. It will push the model under the `bigscience` organization with the name `output_model_name`

Then run:

python downsample_model.py \
    --model_name [MODEL_NAME] --output_model_name [OUTPUT_MODEL_NAME] \
    --hidden_downsampling_rate [HIDDEN_DOWNSAMPLING_RATE] --layer_downsampling_rate [LAYER_DOWNSAMPLING_RATE] \
    --aggregation_strategy [AGGREGATION_STRATEGY] --layer_selection_strategy [LAYER_SELECTION_STRATEGY] \
    [--push_to_hub]

bigscience-workshop / bloom-dechonk