CodeLlama-34b loss=0

Question

CodeLlama-34b loss=0

cdj0311 opened this issue 10 months ago · comments

hi,
I finetune CodeLlama-34b-python-hf with train_wizardcoder.py, but I get loss=0 after trained over a hundred steps, however, the 7b and 13b did not have this promble.

Environment

PyTorch == 2.0.1
Transformers == 4.31.0
Deepspeed == 0.9.3

Script

BASE_MODEL=./CodeLlama-34b-Python-hf
OUTPUT_MODEL=./CodeLlama-34b-Python-Evol

torchrun --nproc_per_node 8 \
         --nnodes 4 \
         --node_rank 0 \
         --master_addr "localhost" \
         --master_port 6000 \
        train_wizard.py \
        --model_name_or_path $BASE_MODEL \
        --data_path "/path/code-evol-instruct.json" \
        --output_dir $OUTPUT_MODEL \
        --num_train_epochs 2 \
        --model_max_length 2048 \
        --per_device_train_batch_size 1 \
        --per_device_eval_batch_size 1 \
        --gradient_accumulation_steps 1 \
        --evaluation_strategy "no" \
        --save_strategy "steps" \
        --save_steps 1000 \
        --save_total_limit 3 \
        --learning_rate 2e-5 \
        --warmup_steps 0 \
        --logging_steps 1 \
        --lr_scheduler_type "cosine" \
        --report_to "tensorboard" \
        --gradient_checkpointing True \
        --deepspeed configs/deepspeed_config.json \
        --bf16 True

Deespeed config:

{
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 0,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 0,
        "stage3_max_reuse_distance": 0,
        "stage3_gather_16bit_weights_on_model_save": true
    },
    "bf16": {
        "enabled": true
     },
    "optimizer": {
        "type": "AdamW",
        "params": {
          "lr": "auto",
          "betas": [
            0.9,
            0.999
          ],
          "eps": 1e-8,
          "weight_decay": 0
        }
    },
    "scheduler": {
        "type": "WarmupDecayLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto",
            "total_num_steps": "auto"
        }
    },
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "gradient_accumulation_steps": "auto",
    "wall_clock_breakdown": false
}

cdj0311 · Answer 1 · Fri Sep 01 2023 14:05:16 GMT+0800 (China Standard Time)

have solved.

Catchher · Answer 2 · Fri Dec 08 2023 13:49:18 GMT+0800 (China Standard Time)

I got the same error, how do you solve this problem, bro.