How to fix unstable loss. I am using wizardlm or Llama-X training code with vicuna style chat format for fine-tuning Llama-2-7b-hf model.

Question

How to fix unstable loss. I am using wizardlm or Llama-X training code with vicuna style chat format for fine-tuning Llama-2-7b-hf model.

apt-team-018 opened this issue 9 months ago · comments

I'm using the 'Llama-X' (https://github.com/AetherCortex/Llama-X) training code with the vicuna-style chat template to fine-tune the Llama-2-7b-hf model. However, I'm observing an unstable loss during the process.

Please find the detailed Weights & Biases report at (https://wandb.ai/arpitsh018/huggingface/reports/Untitled-Report--Vmlldzo1NjE2Njgz).

Training Parameters:

os.system(f'deepspeed train.py
--model_name_or_path meta-llama/Llama-2-7b-hf
--data_path ../data/dummy_conversation.json
--output_dir ./checkpoint/finetuned-llama2-7b
--num_train_epochs 1
--model_max_length 4096
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 1
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 1000
--save_total_limit 3
--learning_rate 2e-5
--warmup_steps 0
--logging_steps 1
--lr_scheduler_type "cosine"
--report_to "wandb"
--gradient_checkpointing True
--deepspeed configs/deepspeed_config.json
--bf16 True')

deepspeed_config.json:

{
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 0,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 0,
"stage3_max_reuse_distance": 0,
"stage3_gather_16bit_weights_on_model_save": true
},
"bf16": {
"enabled": true
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": [0.9, 0.999],
"eps": 1e-8,
"weight_decay": 0
}
},
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto",
"total_num_steps": "auto"
}
},
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"wall_clock_breakdown": false
}

I'm eager to understand how to stabilize the loss for my training sessions. Any insights or recommendations would be greatly appreciated.