Eval Loss NaN on Llama-2

Question

Eval Loss NaN on Llama-2

mmichaelzhang opened this issue a year ago · comments

Hi,

By anychance, have you tried literally run on the llama-2 model?

I tried using default llama parameters for pruning and post-training, resulting in similar wikitext2 score (~19) but much worse score for ptb (~70).

Also, when running post-training with the parameter set of llama, llama-2 loss explodes after ~0.2 epoch. Tried using smaller lr (1e-5) yet eval loss exploded to nan.

It would be of great help if you could provide some insights on both pruning and post-training parameters.

Thanks.

Horseee · Answer 1 · Sat Jul 22 2023 15:54:08 GMT+0800 (China Standard Time)

Greetings!

Regarding the test results on PTB after pruning, the reason for the worse score lies in the unpruned llama2-7B model, which was approximately 47, significantly higher than llama-7B (~22) on PTB.

As for the issue of NaN during post-training, we encountered the same problem you reported. Currently, we are searching for the appropriate hyper-parameters to fine-tune the pruned model. If we obtain any new findings or find any bugs in our code, we will promptly update you.

Michael Zhang · Answer 2 · Sat Jul 22 2023 17:54:38 GMT+0800 (China Standard Time)

Thank you for the timely reply! Hope to get back with good news.

Cheers.

KYANG · Answer 3 · Mon Mar 04 2024 14:25:48 GMT+0800 (China Standard Time)

@mmichaelzhang Have you resolved this issue? I also observed training loss explosion and encountered performance deterioration for llama2-7b using default llama settings:

Wikitext2 w/o tune	Ptb w/o tune	BoolQ Acc	PIQA Acc	HellaSwag Acc_norm	WinoGrande Acc	ARC-e acc	ARC-c Acc_norm	OBQA Acc_norm
19.24	72.61	37.83	52.34	26.64	49.41	25.08	27.82	28.40

As pruned model weights is quantized by int8 and frozen for post-training, I think the phenomenon is non-related with BF16/FP16 dtype, which is considered as the cause by authors:

Tip: Training LLaMA-2 in float16 is not recommended and is known to produce nan; as such, the model should be trained in bfloat16.