Nan loss in training

Question

Nan loss in training

tranquangchung opened this issue a year ago · comments

Hi
Thanks for sharing your project.
When I trained your model based on your config, however, the val and train loss was NAN.
I tried many times but the results are still the same.
Can you tell me the reasons and how to solve it?

chung tq · Answer 1 · Thu May 25 2023 16:55:56 GMT+0800 (China Standard Time)

The problem made NAN is the Language model. So, I solved this problem by modifying your code, and it worked very well.

Sreyan Ghosh · Answer 2 · Sun Jul 16 2023 09:32:03 GMT+0800 (China Standard Time)

Hi @tranquangchung , How did you solve the nan problem? Thank You!

Bingliang Li · Answer 3 · Tue Apr 16 2024 20:08:32 GMT+0800 (China Standard Time)

Hi, could you please explain how do you solve this problem? Thx!

Bingliang Li · Answer 4 · Tue Apr 16 2024 20:41:44 GMT+0800 (China Standard Time)

It turns out the problem is with google/flan-t5-large, this model does not support fp16 training, use fp32 it would be fine.

Soujanya Poria · Answer 5 · Tue Apr 16 2024 20:43:33 GMT+0800 (China Standard Time)

Glad to know that it was solved. FYI we have released Tango 2: https://arxiv.org/abs/2404.09956