Eval loss is 'nan'

Question

Eval loss is 'nan'

mshumer opened this issue a year ago · comments

All settings are standard, eval loss keeps coming back as nan. Why might this be?

Ariel N. Lee · Answer 1 · Tue Aug 22 2023 11:19:41 GMT+0800 (China Standard Time)

Thanks again @mshumer !

For future users: Here's a link to our twitter/troubleshooting conversation: https://twitter.com/mattshumer_/status/1693776553535041684

If you are fine-tuning LLaMa-2 7B, please add bf16=True and change fp16=False in the HF trainer. LLaMa-1 7B works fine as is.

emrecanacikgoz · Answer 2 · Tue Aug 22 2023 15:13:20 GMT+0800 (China Standard Time)

Hi @arielnlee,

I was facing that issue before in your repo. Right now I am able to run llama2-7b using your script (with some slight modifications) in 1xV100, with fp16 dtype and loading it as 8bits:
model = LlamaForCausalLM.from_pretrained(base_model, load_in_8bit=True, torch_dtype=torch.float16, device_map="auto") with fp16=True, in training args. What I have changed in your repo is; I commented your right padding restriction since it may cause some overflow issues during training (I don't now the reason why, I saw this comment in an issue which couldn't find right now). I also use tokenizer.pad_token = tokenizer.eos_token rather setting to 0. Finally, I've added pytorch's autocast method right before calling the trainer as below:

with torch.autocast("cuda"):
        trainer.train()

My training and eval losses are converging fine and looks stable. I'll update the thread if I can find the link of that right padding issue I've mentioned.

Ariel N. Lee · Answer 3 · Thu Aug 24 2023 23:32:26 GMT+0800 (China Standard Time)

@emrecanacikgoz awesome! I will have to try this ASAP. Thank you for the feedback! If you notice anything else, please let me know :)