microsoft / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Some issues in the Twin-Flow feature provided by Zero-offload ++

BingxuZhu opened this issue · comments

Hello, thank you for your contribution to twin-offload. When I tried to run ds_pretrain_gpt_2.7B.sh at Megatron-Deepspeed with the latest parameter "offload_optimizer":"ratio", I tried to set the parameter value from 0.0 to 1.0. I found that when it was training, the CPU Virtual Memory was the same when the ratio parameter was set from 0.0 to 0.4, and the cpu usage was the same when the ratio parameter was set from 0.5 to 1.0. Here's what happened with the scripts and arguments I used and cpu usage.

#ds_config_gpt_TEMPLATE.json

{
  "train_batch_size" : CONFIG_BATCH_SIZE,
  "train_micro_batch_size_per_gpu": CONFIG_MBSIZE,
  "steps_per_print": LOG_INTERVAL,

  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true,
      "ratio": 0.1
    }
  },

  "gradient_clipping": 1.0,
  "prescale_gradients":false,

  "fp16": {
    "enabled": CONFIG_FP16_ENABLED,
    "loss_scale": 0,
    "loss_scale_window": 500,
    "hysteresis": 2,
    "min_loss_scale": 1,
    "initial_scale_power": 11
  },

  "bf16": {
    "enabled": CONFIG_BF16_ENABLED
  },

  "wall_clock_breakdown" : false
}

If I set the ratio parameter to 0.0, 0.1, 0.2, 0.3, 0.4, the CPU Virtual Memory in the output log is about 51GB, and percent is about 27%.

[2023-12-05 20:47:44,452] [INFO] [utils.py:802:see_memory_usage] Before creating fp16 partitions
[2023-12-05 20:47:44,453] [INFO] [utils.py:803:see_memory_usage] MA 0.63 GB         Max_MA 0.63 GB         CA 0.72 GB         Max_CA 1 GB 
[2023-12-05 20:47:44,453] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 34.79 GB, percent = 18.6%
[2023-12-05 20:47:45,070] [INFO] [utils.py:802:see_memory_usage] After creating fp16 partitions: 2
[2023-12-05 20:47:45,071] [INFO] [utils.py:803:see_memory_usage] MA 0.63 GB         Max_MA 0.63 GB         CA 0.63 GB         Max_CA 1 GB 
[2023-12-05 20:47:45,072] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 39.22 GB, percent = 20.9%
[2023-12-05 20:47:45,151] [INFO] [utils.py:802:see_memory_usage] Before creating fp32 partitions
[2023-12-05 20:47:45,151] [INFO] [utils.py:803:see_memory_usage] MA 0.63 GB         Max_MA 0.63 GB         CA 0.63 GB         Max_CA 1 GB 
[2023-12-05 20:47:45,152] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 34.84 GB, percent = 18.6%
[2023-12-05 20:47:45,222] [INFO] [utils.py:802:see_memory_usage] After creating fp32 partitions
[2023-12-05 20:47:45,223] [INFO] [utils.py:803:see_memory_usage] MA 1.88 GB         Max_MA 2.51 GB         CA 2.51 GB         Max_CA 3 GB 
[2023-12-05 20:47:45,224] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 34.84 GB, percent = 18.6%
[2023-12-05 20:47:45,595] [INFO] [utils.py:802:see_memory_usage] Before initializing optimizer states
[2023-12-05 20:47:45,596] [INFO] [utils.py:803:see_memory_usage] MA 1.88 GB         Max_MA 1.88 GB         CA 2.51 GB         Max_CA 3 GB 
[2023-12-05 20:47:45,597] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 52.16 GB, percent = 27.8%
[2023-12-05 20:47:47,492] [INFO] [utils.py:802:see_memory_usage] After initializing optimizer states
[2023-12-05 20:47:47,493] [INFO] [utils.py:803:see_memory_usage] MA 5.65 GB         Max_MA 5.65 GB         CA 6.27 GB         Max_CA 6 GB 
[2023-12-05 20:47:47,493] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 51.08 GB, percent = 27.3%
[2023-12-05 20:47:47,493] [INFO] [stage3.py:479:_setup_for_real_optimizer] optimizer state initialized
[2023-12-05 20:47:47,988] [INFO] [utils.py:802:see_memory_usage] After initializing ZeRO optimizer
[2023-12-05 20:47:47,989] [INFO] [utils.py:803:see_memory_usage] MA 6.58 GB         Max_MA 6.64 GB         CA 7.21 GB         Max_CA 7 GB 
[2023-12-05 20:47:47,989] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 51.1 GB, percent = 27.3%

Similarly, when I set the ratio parameter to 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, the CPU Virtual Memory in the output log is about 89GB, the percent is about 47%

[2023-12-05 20:24:23,817] [INFO] [utils.py:802:see_memory_usage] After creating fp32 partitions
[2023-12-05 20:24:23,819] [INFO] [utils.py:803:see_memory_usage] MA 0.63 GB         Max_MA 0.63 GB         CA 0.65 GB         Max_CA 1 GB 
[2023-12-05 20:24:23,819] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 44.74 GB, percent = 23.9%
[2023-12-05 20:24:24,362] [INFO] [utils.py:802:see_memory_usage] Before initializing optimizer states
[2023-12-05 20:24:24,363] [INFO] [utils.py:803:see_memory_usage] MA 0.63 GB         Max_MA 0.63 GB         CA 0.65 GB         Max_CA 1 GB 
[2023-12-05 20:24:24,364] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 63.86 GB, percent = 34.1%
[2023-12-05 20:24:27,146] [INFO] [utils.py:802:see_memory_usage] After initializing optimizer states
[2023-12-05 20:24:27,148] [INFO] [utils.py:803:see_memory_usage] MA 0.64 GB         Max_MA 0.64 GB         CA 0.65 GB         Max_CA 1 GB 
[2023-12-05 20:24:27,148] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 81.11 GB, percent = 43.3%
[2023-12-05 20:24:27,162] [INFO] [stage3.py:479:_setup_for_real_optimizer] optimizer state initialized
[2023-12-05 20:24:29,339] [INFO] [utils.py:802:see_memory_usage] After initializing ZeRO optimizer
[2023-12-05 20:24:29,340] [INFO] [utils.py:803:see_memory_usage] MA 1.57 GB         Max_MA 1.63 GB         CA 1.64 GB         Max_CA 2 GB 
[2023-12-05 20:24:29,340] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 89.22 GB, percent = 47.6%
#ds_pretrain_gpt_2.7B.sh

#!/bin/bash
DIR=`pwd`
SEQ_LEN=2048

MODEL_SIZE=2.7
NUM_LAYERS=32
HIDDEN_SIZE=2560
NUM_ATTN_HEADS=32
GLOBAL_BATCH_SIZE=512
LR=1.6e-4
MIN_LR=1.6e-5

TRAIN_TOKENS=300000000000

TRAIN_SAMPLES=$(( ${TRAIN_TOKENS} * 3 / ${SEQ_LEN} ))

EXIT_DURATION=30000000

WARMUP_TOKENS=375000000
LR_DECAY_TOKENS=260000000000

BATCH_SIZE=2

MP_SIZE=8

PP_SIZE=1
NUM_GPUS=8

EP_SIZE=1

***.........default config

TENSORBOARD_DIR="${OUTPUT_BASEPATH}/tensorboard/${NAME}_${host}_${current_time}"
mkdir -p ${TENSORBOARD_DIR} 

CHECKPOINT_PATH="${OUTPUT_BASEPATH}/checkpoint/${NAME}"

VOCAB_PATH=/home/wangzhigangcs/zbx/Megatron-DeepSpeed-2348eed9ab8f851fd366f869b62f4f643eb50b41/dataset/data/gpt2-vocab.json
MERGE_PATH=/home/wangzhigangcs/zbx/Megatron-DeepSpeed-2348eed9ab8f851fd366f869b62f4f643eb50b41/dataset/data/gpt2-merges.txt
# Public the Pile dataset, can be downloaded at https://mystic.the-eye.eu/public/AI/pile_neox/
DATA_BLEND=/home/wangzhigangcs/zbx/Megatron-DeepSpeed-2348eed9ab8f851fd366f869b62f4f643eb50b41/dataset/BookCorpusDataset_text_document/BookCorpusDataset_text_document

###############################################################################
data_options=" \
         --vocab-file ${VOCAB_PATH} \
         --merge-file ${MERGE_PATH} \
         --data-path ${DATA_BLEND} \
         --data-impl mmap"
        
megatron_options=" \
        --override-opt_param-scheduler \
        --adam-beta1 0.9 \
        --adam-beta2 0.95 \
        --tensor-model-parallel-size ${MP_SIZE} \
        --moe-expert-parallel-size ${EP_PARALLEL_SIZE} \
        --num-experts ${EP_SIZE} \
        --moe-loss-coeff ${MLC} \
        --moe-train-capacity-factor ${MOE_TRAIN_CAP_FACTOR} \
        --moe-eval-capacity-factor ${MOE_EVAL_CAP_FACTOR} \
        --moe-min-capacity ${MOE_MIN_CAP} \
        --init-method-std ${INIT_STD} \
        --lr-decay-tokens ${LR_DECAY_TOKENS} \
        --lr-warmup-tokens ${WARMUP_TOKENS} \
        --micro-batch-size ${BATCH_SIZE} \
        --exit-duration-in-mins ${EXIT_DURATION} \
        --rampup-batch-size 32 32 1953125 \
        --global-batch-size ${GLOBAL_BATCH_SIZE} \
        --num-layers ${NUM_LAYERS} \
        --hidden-size ${HIDDEN_SIZE} \
        --num-attention-heads ${NUM_ATTN_HEADS} \
        --seq-length ${SEQ_LEN} \
        --max-position-embeddings ${SEQ_LEN} \
        --train-tokens ${TRAIN_TOKENS} \
        --train-samples ${TRAIN_SAMPLES} \
        --lr ${LR} \
        --min-lr ${MIN_LR} \
        --lr-decay-style cosine \
        --split 98,2,0 \
        --log-interval ${LOG_INTERVAL} \
        --eval-interval ${EVAL_INTERVAL} \
        --eval-iters ${EVAL_ITERS} \
        --save-interval ${SAVE_INTERVAL} \
        --weight-decay 0.1 \
        --clip-grad 1.0 \
        --hysteresis 2 \
        --num-workers 0 \
        --fp16 \
        --load ${CHECKPOINT_PATH} \
        --save ${CHECKPOINT_PATH} \
        --tensorboard-queue-size 1 \
        --log-timers-to-tensorboard \
        --timing-log-level 1 \
        --no-pipeline-parallel \
        --cpu-optimizer \
        --distributed-timeout-minutes 60 \
        --log-batch-size-to-tensorboard \
        --log-validation-ppl-to-tensorboard \
        --tensorboard-dir ${TENSORBOARD_DIR}"

if [ "${ACTIVATION_CHECKPOINT}" = "true" ]; then
megatron_options="${megatron_options} \
        --checkpoint-activations"
fi

if [[ $EP_SIZE -gt 1 ]]; then
megatron_options="${megatron_options} \
        --create-moe-param-group"
fi

if [ "${MOE_DROP_TOKEN}" = "false" ]; then
megatron_options="${megatron_options} \
        --disable-moe-token-dropping"
fi

template_json="ds_config_gpt_TEMPLATE.json"
config_json="ds_config_gpt_${NAME}.json"
sed "s/CONFIG_BATCH_SIZE/${GLOBAL_BATCH_SIZE}/" ${template_json} \
    | sed "s/CONFIG_MBSIZE/${BATCH_SIZE}/" \
    | sed "s/LOG_INTERVAL/${LOG_INTERVAL}/" \
    | sed "s/ZERO_STAGE/3/" \
    | sed "s/PRESCALE_GRAD/true/" \
    | sed "s/CONFIG_FP16_ENABLED/false/" \
    | sed "s/CONFIG_BF16_ENABLED/true/" \
    | sed "s/CONFIG_CL_ENABLED/${CL_ENABLED}/" \
    | sed "s/CONFIG_CL_MIN/${CL_START_SEQLEN}/" \
    | sed "s/CONFIG_CL_MAX/${SEQ_LEN}/" \
    | sed "s/CONFIG_CL_DURATION/${CL_STEP}/" \
	  > ${config_json}

deepspeed_options=" \
		    --deepspeed \
		    --deepspeed_config ${config_json} \
		    --pipeline-model-parallel-size ${PP_SIZE}"

# Currently MoE is not compatible with pipeline parallel
if [[ $EP_SIZE -gt 1 ]]; then
deepspeed_options="${deepspeed_options} \
        --no-pipeline-parallel"
fi

if [ "${ACTIVATION_CHECKPOINT}" = "true" ]; then
deepspeed_options="${deepspeed_options} \
        --deepspeed-activation-checkpointing"
fi

run_cmd="deepspeed ${DIR}/../../pretrain_gpt.py ${megatron_options} ${data_options} ${deepspeed_options} &> ${OUTPUT_BASEPATH}/log/${NAME}_${host}_${current_time}.log"
echo ${run_cmd}
eval ${run_cmd}
set +x

For the ds_pretrain_gpt_2.7B.sh script: Compared with the 350M.sh script of Zero-offload ++ Tutorials in offload_pp directory, I only changed its model size and some necessary dataset configuration. I don't know why this happened. I am eager to use Twin-Flow partial offload function, hope you can answer me, thank you

This is my lab environment: Tesla V100-SXM2-16GB * 8, Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz, CPU total Memory 187Gb.
Deepspeed0.12.4

microsoft/DeepSpeed#4775

Hi @BingxuZhu , I think this replicated issue is solved on your post to deepspeed repo (as issue above). Close it for now