[question] nan loss value and run time error

Question

[question] nan loss value and run time error

nevermet opened this issue 9 months ago · comments

Dear all,

I finetuning mydata with open llama. While running finetune/lora.py, I don't see the loss number as follows:
...
iter 3198: loss nan, time: 134.94ms

and while validating, it ends up with an error:
...
File ".../lit-llama/generate.py", line 74, in generate
idx_next = torch.multinomial(probs, num_samples=1).to(dtype=dtype)
RuntimeError: probability tensor contains either inf, nan or element < 0

Could you tell me how I can resolve this?

Thanks in advance.