bigscience-workshop / Megatron-DeepSpeed

When I run this https://github.com/bigscience-workshop/bigscience/blob/master/train/tr11-176B-ml/smaller_models/tr11f-6B3-ml-continuation.slurm script to continue training the model ,I got a strange grad norm and huge loss after auto reduced loss-scale of overflow .Just like I am training from random.

part of my running script:

TOKENIZER_NAME_OR_PATH=/data/pengjun/Megatron-DeepSpeed/byte-level-bpe-tokenizer-no-norm-250k-whitespace-and-eos-regex-alpha-v3-dedup-lines-articles

MASTER_ADDR=localhost
MASTER_PORT=6002

GPUS_PER_NODE=4
NNODES=1

PP_SIZE=1
TP_SIZE=1

MICRO_BATCH_SIZE=1
GLOBAL_BATCH_SIZE=512

NLAYERS=30
NHIDDEN=4096
NHEADS=32
SEQ_LEN=2048

SAVE_INTERVAL=1

TRAIN_SAMPLES=2_200_000  # 450B tokens
LR_DECAY_SAMPLES=2_000_000  # Decay for the first 410B tokens then continue at fixed --min-lr
LR_WARMUP_SAMPLES=183_105  # 375M tokens

OPTIMIZER_ARGS=" \
    --optimizer adam \
    --adam-beta1 0.9 \
    --adam-beta2 0.95 \
    --adam-eps 1e-8 \
    --lr 1.2e-5 \
    --min-lr 1e-6 \
    --lr-decay-style cosine \
    --lr-decay-samples $LR_DECAY_SAMPLES \
    --lr-warmup-samples $LR_WARMUP_SAMPLES \
    --clip-grad 1.0 \
    --weight-decay 1e-1 \
    "

EXIT_OPTS=" \
    --exit-duration-in-mins 5990 \
    "

GPT_ARGS=" \
    --pp-partition-method type:transformer|embedding \
    --num-layers $NLAYERS \
    --hidden-size $NHIDDEN \
    --num-attention-heads $NHEADS \
    --seq-length $SEQ_LEN \
    --max-position-embeddings $SEQ_LEN \
    --micro-batch-size $MICRO_BATCH_SIZE \
    --rampup-batch-size 192 32 9_765_625 \
    --global-batch-size $GLOBAL_BATCH_SIZE \
    --train-samples $TRAIN_SAMPLES \
    --tokenizer-type PretrainedFromHF \
    --tokenizer-name-or-path $TOKENIZER_NAME_OR_PATH \
    --init-method-std 0.0048 \
    --override-lr-scheduler \
    --embed-layernorm \
    --fp16 \
    --seed 42 \
    --position-embedding-type alibi \
    --checkpoint-activations \
    --abort-on-unmet-fused-kernel-constraints \
    --pad-vocab-size-to 250880 \
    $OPTIMIZER_ARGS \
    $EXIT_OPTS \
    "


OUTPUT_ARGS=" \
    --log-interval 1 \
    --save-interval 10 \
    --eval-interval 1000 \
    --eval-iters 1 \
    --tensorboard-dir $TENSORBOARD_PATH \
    --tensorboard-queue-size 5 \
    --log-timers-to-tensorboard \
    --log-batch-size-to-tensorboard \
    --log-validation-ppl-to-tensorboard \
    "

ZERO_STAGE=1 # important: bf16 must use z0! it implements its own zero stage 1 equivalent

config_json="./ds_config_1.json"

# Deepspeed figures out GAS dynamically from dynamic GBS via set_train_batch_size()
cat <<EOT > $config_json
{
  "train_micro_batch_size_per_gpu": $MICRO_BATCH_SIZE,
  "train_batch_size": $GLOBAL_BATCH_SIZE,
  "gradient_clipping": 1.0,
  "zero_optimization": {
    "stage": $ZERO_STAGE
  },
  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "loss_scale_window": 500,
    "hysteresis": 2,
    "min_loss_scale": 1,
    "initial_scale_power": 12
  },
  "steps_per_print": 2000,
  "wall_clock_breakdown": false
}
EOT



DEEPSPEED_ARGS=" \
    --deepspeed \
    --deepspeed_config ${config_json} \
    --zero-stage ${ZERO_STAGE} \
    --deepspeed-activation-checkpointing \
    "

export LAUNCHER="python -u -m torch.distributed.run \
    --nproc_per_node $GPUS_PER_NODE \
    --nnodes $NNODES \
    --rdzv_endpoint $MASTER_ADDR:$MASTER_PORT \
    --rdzv_backend c10d \
    --max_restarts 0 \
    --tee 3 \
    "

export CMD=" \
    `pwd`/pretrain_gpt.py \
    --tensor-model-parallel-size $TP_SIZE \
    $GPT_ARGS \
    $OUTPUT_ARGS \
    --save $CHECKPOINT_PATH \
    --load /data/pengjun/Megatron-DeepSpeed/checkpoints/tr11b-7B1-ml/checkpoints/main4 \
    --train-weighted-split-paths-path $TRAIN_DATA_PATH \
    --valid-weighted-split-paths-path $VALID_DATA_PATH \
    --data-impl mmap \
    --distributed-backend nccl \
     $DEEPSPEED_ARGS \
    "
    # --load /data/pengjun/model_ckpt/bloom-7b1-optimizer-states \
#    --pipeline-model-parallel-size $PP_SIZE \

echo $CMD

# do not remove or the training will hang and nodes will be lost w/o this workaround
export CUDA_LAUNCH_BLOCKING=1

# hide duplicated errors using this hack - will be properly fixed in pt-1.12
export TORCHELASTIC_ERROR_FILE=/tmp/torch-elastic-error.json

$LAUNCHER  $CMD 2>&1 | tee -a $LOGS_PATH/main_log.txt

echo "END TIME: $(date)"

@stas00 I'm stuck here, could you please give me some advice?
I have checked these:

same training args in saved checkpoints(tokenizer\params etc.)
same accuracy and consistent params in checkpoints
same increase of grad norm in other models like 1B7 and 3B
same phenomenon when I change lr_schedular params

the initial topology conversion was written for BF16Optimizer, but here you use zero stage=1, which I haven't worked with, so I have no experience with this use-case.

Tagging @tjruwase who will know the answer as he has been working on porting the functionality to stages 1,2,3. I am not sure if it's complete or not.

the initial topology conversion was written for BF16Optimizer, but here you use zero stage=1, which I haven't worked with, so I have no experience with this use-case.

Tagging @tjruwase who will know the answer as he has been working on porting the functionality to stages 1,2,3. I am not sure if it's complete or not.

@tjruwase I have tried stage 0,1 on checkpoint of https://huggingface.co/bigscience/bloom-7b1-optimizer-states and checkpoint transfered from https://huggingface.co/bigscience/bloom-7b1 to DS . All trials have the same issue

I can not build this tokenizer by rust and I use tokenizer=0.12.0 instead. Does this matters? @stas00

Honestly I'm not sure as I wasn't part of the data team. I remember they said that most likely the normal tokenizer should work, but it might be safer to use that custom version.

You can try the normal tokenizer and see if it suites your needs.

And of course we can try and see why you can't build the custom tokenizer. Will need the cmd and the full traceback to understand. But perhaps in a different Issue please so it's not mixed up into this Issue.

Honestly I'm not sure as I wasn't part of the data team. I remember they said that most likely the normal tokenizer should work, but it might be safer to use that custom version.

You can try the normal tokenizer and see if it suites your needs.

And of course we can try and see why you can't build the custom tokenizer. Will need the cmd and the full traceback to understand. But perhaps in a different Issue please so it's not mixed up into this Issue.

Hi，I have build tokenizer successfully.But I still get the same result that grad norm and loss increased sharply. Could you please provide more details about the training of 1B7 or 3B or 7B1 models? @stas00 @tjruwase

Could you please provide more details about the training of 1B7 or 3B or 7B1 models?

I only worked on 176B so I'm not the right person to ask. @TevenLeScao, @thomasw21 - would you know who trained the 1B7 or 3B or 7B1 models - do we have the slurm jobs for those somewhere on https://github.com/bigscience-workshop/bigscience/tree/master/train and if they aren't there we probably should add those. Thank you!

I think the first grad norms that are 0 are linked to overflow, so essentially we drop the batch and reduce the loss factor. So everything should be normal, typically you can look at the decreasing lm loss

EDIT: actually my bad the loss is increasing, maybe have a look at your learning rate that's too high?

I think the first grad norms that are 0 are linked to overflow, so essentially we drop the batch and reduce the loss factor. So everything should be normal, typically you can look at the decreasing lm loss

EDIT: actually my bad the loss is increasing, maybe have a look at your learning rate that's too high?

Yes ,apex wil ignore the step when the grad not in the range of fp16 and then decrease the loss scale.

I have tried to decrease the learning rate step by step.But I stiil got the same tendency of grad norm and loss.
Actually the loss are decreasing normally after iteration 5 .
@thomasw21

a small correction: that's not apex, but Deepspeed's top level optimizer doing the skipping.

I'm not sure what you're missing. My honest guess if you're not loading the optimizer state correctly as during the training process we would reschedule jobs all the time linked to our compute provider and so loading from a state was a necessary feature. Could you please retry with https://huggingface.co/bigscience/bloom-7b1-optimizer-states .

If it still doesn't work, I don't have a good idea of what is wrong from the top of my head, and don't have the bandwidth to help you debug this unfortunately ...

@misska1 same question, could you fix it ?

I solved this by adding model[0].optimizer.refresh_fp32_params() right before this line:

Megatron-DeepSpeed/megatron/checkpointing.py

Line 285 in e52bdab

release = False

grad norm increase strangely