Eval loss is 'nan'
mshumer opened this issue · comments
All settings are standard, eval loss keeps coming back as nan
. Why might this be?
Thanks again @mshumer !
For future users: Here's a link to our twitter/troubleshooting conversation: https://twitter.com/mattshumer_/status/1693776553535041684
If you are fine-tuning LLaMa-2 7B, please add bf16=True
and change fp16=False
in the HF trainer. LLaMa-1 7B works fine as is.
Hi @arielnlee,
I was facing that issue before in your repo. Right now I am able to run llama2-7b using your script (with some slight modifications) in 1xV100, with fp16 dtype and loading it as 8bits:
model = LlamaForCausalLM.from_pretrained(base_model, load_in_8bit=True, torch_dtype=torch.float16, device_map="auto")
with fp16=True,
in training args. What I have changed in your repo is; I commented your right padding restriction since it may cause some overflow issues during training (I don't now the reason why, I saw this comment in an issue which couldn't find right now). I also use tokenizer.pad_token = tokenizer.eos_token
rather setting to 0. Finally, I've added pytorch's autocast method right before calling the trainer as below:
with torch.autocast("cuda"):
trainer.train()
My training and eval losses are converging fine and looks stable. I'll update the thread if I can find the link of that right padding issue I've mentioned.
@emrecanacikgoz awesome! I will have to try this ASAP. Thank you for the feedback! If you notice anything else, please let me know :)