CodeLlama-34b loss=0
cdj0311 opened this issue · comments
hi,
I finetune CodeLlama-34b-python-hf with train_wizardcoder.py, but I get loss=0 after trained over a hundred steps, however, the 7b and 13b did not have this promble.
Environment
PyTorch == 2.0.1
Transformers == 4.31.0
Deepspeed == 0.9.3
Script
BASE_MODEL=./CodeLlama-34b-Python-hf
OUTPUT_MODEL=./CodeLlama-34b-Python-Evol
torchrun --nproc_per_node 8 \
--nnodes 4 \
--node_rank 0 \
--master_addr "localhost" \
--master_port 6000 \
train_wizard.py \
--model_name_or_path $BASE_MODEL \
--data_path "/path/code-evol-instruct.json" \
--output_dir $OUTPUT_MODEL \
--num_train_epochs 2 \
--model_max_length 2048 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 1000 \
--save_total_limit 3 \
--learning_rate 2e-5 \
--warmup_steps 0 \
--logging_steps 1 \
--lr_scheduler_type "cosine" \
--report_to "tensorboard" \
--gradient_checkpointing True \
--deepspeed configs/deepspeed_config.json \
--bf16 True
Deespeed config:
{
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 0,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 0,
"stage3_max_reuse_distance": 0,
"stage3_gather_16bit_weights_on_model_save": true
},
"bf16": {
"enabled": true
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": [
0.9,
0.999
],
"eps": 1e-8,
"weight_decay": 0
}
},
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto",
"total_num_steps": "auto"
}
},
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"wall_clock_breakdown": false
}
have solved.
I got the same error, how do you solve this problem, bro.