Some issues in the Twin-Flow feature provided by Zero-offload ++
BingxuZhu opened this issue · comments
Hello, thank you for your contribution to twin-offload
. When I tried to run ds_pretrain_gpt_2.7B.sh
at Megatron-Deepspeed with the latest parameter "offload_optimizer":"ratio"
, I tried to set the parameter value from 0.0 to 1.0
. I found that when it was training, the CPU Virtual Memory was the same when the ratio parameter was set from 0.0 to 0.4
, and the cpu usage was the same when the ratio parameter was set from 0.5 to 1.0
. Here's what happened with the scripts and arguments I used and cpu usage.
#ds_config_gpt_TEMPLATE.json
{
"train_batch_size" : CONFIG_BATCH_SIZE,
"train_micro_batch_size_per_gpu": CONFIG_MBSIZE,
"steps_per_print": LOG_INTERVAL,
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true,
"ratio": 0.1
}
},
"gradient_clipping": 1.0,
"prescale_gradients":false,
"fp16": {
"enabled": CONFIG_FP16_ENABLED,
"loss_scale": 0,
"loss_scale_window": 500,
"hysteresis": 2,
"min_loss_scale": 1,
"initial_scale_power": 11
},
"bf16": {
"enabled": CONFIG_BF16_ENABLED
},
"wall_clock_breakdown" : false
}
If I set the ratio parameter to 0.0, 0.1, 0.2, 0.3, 0.4
, the CPU Virtual Memory in the output log is about 51GB, and percent is about 27%.
[2023-12-05 20:47:44,452] [INFO] [utils.py:802:see_memory_usage] Before creating fp16 partitions
[2023-12-05 20:47:44,453] [INFO] [utils.py:803:see_memory_usage] MA 0.63 GB Max_MA 0.63 GB CA 0.72 GB Max_CA 1 GB
[2023-12-05 20:47:44,453] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 34.79 GB, percent = 18.6%
[2023-12-05 20:47:45,070] [INFO] [utils.py:802:see_memory_usage] After creating fp16 partitions: 2
[2023-12-05 20:47:45,071] [INFO] [utils.py:803:see_memory_usage] MA 0.63 GB Max_MA 0.63 GB CA 0.63 GB Max_CA 1 GB
[2023-12-05 20:47:45,072] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 39.22 GB, percent = 20.9%
[2023-12-05 20:47:45,151] [INFO] [utils.py:802:see_memory_usage] Before creating fp32 partitions
[2023-12-05 20:47:45,151] [INFO] [utils.py:803:see_memory_usage] MA 0.63 GB Max_MA 0.63 GB CA 0.63 GB Max_CA 1 GB
[2023-12-05 20:47:45,152] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 34.84 GB, percent = 18.6%
[2023-12-05 20:47:45,222] [INFO] [utils.py:802:see_memory_usage] After creating fp32 partitions
[2023-12-05 20:47:45,223] [INFO] [utils.py:803:see_memory_usage] MA 1.88 GB Max_MA 2.51 GB CA 2.51 GB Max_CA 3 GB
[2023-12-05 20:47:45,224] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 34.84 GB, percent = 18.6%
[2023-12-05 20:47:45,595] [INFO] [utils.py:802:see_memory_usage] Before initializing optimizer states
[2023-12-05 20:47:45,596] [INFO] [utils.py:803:see_memory_usage] MA 1.88 GB Max_MA 1.88 GB CA 2.51 GB Max_CA 3 GB
[2023-12-05 20:47:45,597] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 52.16 GB, percent = 27.8%
[2023-12-05 20:47:47,492] [INFO] [utils.py:802:see_memory_usage] After initializing optimizer states
[2023-12-05 20:47:47,493] [INFO] [utils.py:803:see_memory_usage] MA 5.65 GB Max_MA 5.65 GB CA 6.27 GB Max_CA 6 GB
[2023-12-05 20:47:47,493] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 51.08 GB, percent = 27.3%
[2023-12-05 20:47:47,493] [INFO] [stage3.py:479:_setup_for_real_optimizer] optimizer state initialized
[2023-12-05 20:47:47,988] [INFO] [utils.py:802:see_memory_usage] After initializing ZeRO optimizer
[2023-12-05 20:47:47,989] [INFO] [utils.py:803:see_memory_usage] MA 6.58 GB Max_MA 6.64 GB CA 7.21 GB Max_CA 7 GB
[2023-12-05 20:47:47,989] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 51.1 GB, percent = 27.3%
Similarly, when I set the ratio parameter to 0.5, 0.6, 0.7, 0.8, 0.9, 1.0
, the CPU Virtual Memory in the output log is about 89GB, the percent is about 47%
[2023-12-05 20:24:23,817] [INFO] [utils.py:802:see_memory_usage] After creating fp32 partitions
[2023-12-05 20:24:23,819] [INFO] [utils.py:803:see_memory_usage] MA 0.63 GB Max_MA 0.63 GB CA 0.65 GB Max_CA 1 GB
[2023-12-05 20:24:23,819] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 44.74 GB, percent = 23.9%
[2023-12-05 20:24:24,362] [INFO] [utils.py:802:see_memory_usage] Before initializing optimizer states
[2023-12-05 20:24:24,363] [INFO] [utils.py:803:see_memory_usage] MA 0.63 GB Max_MA 0.63 GB CA 0.65 GB Max_CA 1 GB
[2023-12-05 20:24:24,364] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 63.86 GB, percent = 34.1%
[2023-12-05 20:24:27,146] [INFO] [utils.py:802:see_memory_usage] After initializing optimizer states
[2023-12-05 20:24:27,148] [INFO] [utils.py:803:see_memory_usage] MA 0.64 GB Max_MA 0.64 GB CA 0.65 GB Max_CA 1 GB
[2023-12-05 20:24:27,148] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 81.11 GB, percent = 43.3%
[2023-12-05 20:24:27,162] [INFO] [stage3.py:479:_setup_for_real_optimizer] optimizer state initialized
[2023-12-05 20:24:29,339] [INFO] [utils.py:802:see_memory_usage] After initializing ZeRO optimizer
[2023-12-05 20:24:29,340] [INFO] [utils.py:803:see_memory_usage] MA 1.57 GB Max_MA 1.63 GB CA 1.64 GB Max_CA 2 GB
[2023-12-05 20:24:29,340] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 89.22 GB, percent = 47.6%
#ds_pretrain_gpt_2.7B.sh
#!/bin/bash
DIR=`pwd`
SEQ_LEN=2048
MODEL_SIZE=2.7
NUM_LAYERS=32
HIDDEN_SIZE=2560
NUM_ATTN_HEADS=32
GLOBAL_BATCH_SIZE=512
LR=1.6e-4
MIN_LR=1.6e-5
TRAIN_TOKENS=300000000000
TRAIN_SAMPLES=$(( ${TRAIN_TOKENS} * 3 / ${SEQ_LEN} ))
EXIT_DURATION=30000000
WARMUP_TOKENS=375000000
LR_DECAY_TOKENS=260000000000
BATCH_SIZE=2
MP_SIZE=8
PP_SIZE=1
NUM_GPUS=8
EP_SIZE=1
***.........default config
TENSORBOARD_DIR="${OUTPUT_BASEPATH}/tensorboard/${NAME}_${host}_${current_time}"
mkdir -p ${TENSORBOARD_DIR}
CHECKPOINT_PATH="${OUTPUT_BASEPATH}/checkpoint/${NAME}"
VOCAB_PATH=/home/wangzhigangcs/zbx/Megatron-DeepSpeed-2348eed9ab8f851fd366f869b62f4f643eb50b41/dataset/data/gpt2-vocab.json
MERGE_PATH=/home/wangzhigangcs/zbx/Megatron-DeepSpeed-2348eed9ab8f851fd366f869b62f4f643eb50b41/dataset/data/gpt2-merges.txt
# Public the Pile dataset, can be downloaded at https://mystic.the-eye.eu/public/AI/pile_neox/
DATA_BLEND=/home/wangzhigangcs/zbx/Megatron-DeepSpeed-2348eed9ab8f851fd366f869b62f4f643eb50b41/dataset/BookCorpusDataset_text_document/BookCorpusDataset_text_document
###############################################################################
data_options=" \
--vocab-file ${VOCAB_PATH} \
--merge-file ${MERGE_PATH} \
--data-path ${DATA_BLEND} \
--data-impl mmap"
megatron_options=" \
--override-opt_param-scheduler \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--tensor-model-parallel-size ${MP_SIZE} \
--moe-expert-parallel-size ${EP_PARALLEL_SIZE} \
--num-experts ${EP_SIZE} \
--moe-loss-coeff ${MLC} \
--moe-train-capacity-factor ${MOE_TRAIN_CAP_FACTOR} \
--moe-eval-capacity-factor ${MOE_EVAL_CAP_FACTOR} \
--moe-min-capacity ${MOE_MIN_CAP} \
--init-method-std ${INIT_STD} \
--lr-decay-tokens ${LR_DECAY_TOKENS} \
--lr-warmup-tokens ${WARMUP_TOKENS} \
--micro-batch-size ${BATCH_SIZE} \
--exit-duration-in-mins ${EXIT_DURATION} \
--rampup-batch-size 32 32 1953125 \
--global-batch-size ${GLOBAL_BATCH_SIZE} \
--num-layers ${NUM_LAYERS} \
--hidden-size ${HIDDEN_SIZE} \
--num-attention-heads ${NUM_ATTN_HEADS} \
--seq-length ${SEQ_LEN} \
--max-position-embeddings ${SEQ_LEN} \
--train-tokens ${TRAIN_TOKENS} \
--train-samples ${TRAIN_SAMPLES} \
--lr ${LR} \
--min-lr ${MIN_LR} \
--lr-decay-style cosine \
--split 98,2,0 \
--log-interval ${LOG_INTERVAL} \
--eval-interval ${EVAL_INTERVAL} \
--eval-iters ${EVAL_ITERS} \
--save-interval ${SAVE_INTERVAL} \
--weight-decay 0.1 \
--clip-grad 1.0 \
--hysteresis 2 \
--num-workers 0 \
--fp16 \
--load ${CHECKPOINT_PATH} \
--save ${CHECKPOINT_PATH} \
--tensorboard-queue-size 1 \
--log-timers-to-tensorboard \
--timing-log-level 1 \
--no-pipeline-parallel \
--cpu-optimizer \
--distributed-timeout-minutes 60 \
--log-batch-size-to-tensorboard \
--log-validation-ppl-to-tensorboard \
--tensorboard-dir ${TENSORBOARD_DIR}"
if [ "${ACTIVATION_CHECKPOINT}" = "true" ]; then
megatron_options="${megatron_options} \
--checkpoint-activations"
fi
if [[ $EP_SIZE -gt 1 ]]; then
megatron_options="${megatron_options} \
--create-moe-param-group"
fi
if [ "${MOE_DROP_TOKEN}" = "false" ]; then
megatron_options="${megatron_options} \
--disable-moe-token-dropping"
fi
template_json="ds_config_gpt_TEMPLATE.json"
config_json="ds_config_gpt_${NAME}.json"
sed "s/CONFIG_BATCH_SIZE/${GLOBAL_BATCH_SIZE}/" ${template_json} \
| sed "s/CONFIG_MBSIZE/${BATCH_SIZE}/" \
| sed "s/LOG_INTERVAL/${LOG_INTERVAL}/" \
| sed "s/ZERO_STAGE/3/" \
| sed "s/PRESCALE_GRAD/true/" \
| sed "s/CONFIG_FP16_ENABLED/false/" \
| sed "s/CONFIG_BF16_ENABLED/true/" \
| sed "s/CONFIG_CL_ENABLED/${CL_ENABLED}/" \
| sed "s/CONFIG_CL_MIN/${CL_START_SEQLEN}/" \
| sed "s/CONFIG_CL_MAX/${SEQ_LEN}/" \
| sed "s/CONFIG_CL_DURATION/${CL_STEP}/" \
> ${config_json}
deepspeed_options=" \
--deepspeed \
--deepspeed_config ${config_json} \
--pipeline-model-parallel-size ${PP_SIZE}"
# Currently MoE is not compatible with pipeline parallel
if [[ $EP_SIZE -gt 1 ]]; then
deepspeed_options="${deepspeed_options} \
--no-pipeline-parallel"
fi
if [ "${ACTIVATION_CHECKPOINT}" = "true" ]; then
deepspeed_options="${deepspeed_options} \
--deepspeed-activation-checkpointing"
fi
run_cmd="deepspeed ${DIR}/../../pretrain_gpt.py ${megatron_options} ${data_options} ${deepspeed_options} &> ${OUTPUT_BASEPATH}/log/${NAME}_${host}_${current_time}.log"
echo ${run_cmd}
eval ${run_cmd}
set +x
For the ds_pretrain_gpt_2.7B.sh
script: Compared with the 350M.sh
script of Zero-offload ++ Tutorials
in offload_pp
directory, I only changed its model size
and some necessary dataset configuration. I don't know why this happened. I am eager to use Twin-Flow partial offload function, hope you can answer me, thank you
This is my lab environment: Tesla V100-SXM2-16GB * 8, Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz, CPU total Memory 187Gb.
Deepspeed0.12.4
Hi @BingxuZhu , I think this replicated issue is solved on your post to deepspeed repo (as issue above). Close it for now