ptuning是损失为nan

Question

ptuning是损失为nan

silence-moon opened this issue 8 months ago · comments

使用data中的数据进行PT微调，但是发现微调过程中损失一直为nan，同时训练到一半的时候也会出现Current loss scale already at minimum - cannot decrease scale anymore这个错误。最主要还是损失为nan，请问大佬，这是什么情况？（注：当前使用的模型是当前可以从官网下载的最新的chatglm3-6b的模型）
运行命令：
CUDA_VISIBLE_DEVICES=0 deepspeed --master_port 520 train.py
--train_path data/spo_0.json
--model_name_or_path /mnt/workspace/chatglm3-6b/
--per_device_train_batch_size 4
--max_len 1560
--max_src_len 1024
--learning_rate 1e-4
--weight_decay 0.1
--num_train_epochs 2
--gradient_accumulation_steps 4
--warmup_ratio 0.1
--mode glm3
--train_type ptuning
--seed 1234
--ds_file ds_zero3_no_offload.json
--gradient_checkpointing
--show_loss_step 1
--pre_seq_len 16
--prefix_projection True
--output_dir ./output-glm3

微调的输出：
[INFO] [logging.py:96:log_dist] [Rank 0] step=1, skipped=1, lr=[0.0001], mom=[(0.9, 0.95)]
Epoch: 0, step: 4, global_step:1, loss: nan
step: 4-1-1
...
[INFO] [timer.py:260:stop] epoch=0/micro_step=12/global_step=3, RunningAvgSamplesPerSec=0.7988681997053363, CurrSamplesPerSec=0.7988681997053363, MemAllocated=1.08GB, MaxMemAllocated=6.34GB
Epoch: 0, step: 12, global_step:3, loss: nan