Problem of loss ‘NAN’ value during training

Question

Problem of loss ‘NAN’ value during training

YuqiangY opened this issue 4 years ago · comments

Thanks for your work so much.
Like the title, I meet the error that loss value is NAN at 22800 iterations (Second time: 29200) suddenly.
Did you have ever met this kind of error?

Zhengxiong Luo · Answer 1 · Fri Nov 20 2020 14:37:17 GMT+0800 (China Standard Time)

Yes, this situation occurs sometimes. A workaround method is to resume from the last normal training state.

yyq2968 · Answer 2 · Fri Nov 20 2020 14:44:30 GMT+0800 (China Standard Time)

Yes, this situation occurs sometimes. A workaround method is to resume from the last normal training state.

Thanks, it is how I solve this mistake.
However, do you have some clues about this error?

Zhengxiong Luo · Answer 3 · Sat Nov 21 2020 10:34:04 GMT+0800 (China Standard Time)

Sorry, I have not figured it out yet. If you have any ideas please tell me. Thank you.

yyq2968 · Answer 4 · Sat Nov 21 2020 14:14:00 GMT+0800 (China Standard Time)

Sorry, I have not figured it out yet. If you have any ideas please tell me. Thank you.

I couldn't find the key to the problem, too. In the last 20 hours, this error has occurred several times, when the num of iterations over 115000 especially. Is it common?

Zhengxiong Luo · Answer 5 · Sat Nov 21 2020 16:19:13 GMT+0800 (China Standard Time)

In my case, it occurs twice during 400000 iterations. The frequency seems to be random. Maybe it is an inherent drawback of the proposed method, as DAN is actually a recurrent neural network (RNN). Maybe I can borrow some ideas from RNNs to stabilize the training.

yyq2968 · Answer 6 · Sat Nov 21 2020 16:55:54 GMT+0800 (China Standard Time)

In my case, it occurs twice during 400000 iterations. The frequency seems to be random. Maybe it is an inherent drawback of the proposed method, as DAN is actually a recurrent neural network (RNN). Maybe I can borrow some ideas from RNNs to stabilize the training.

OK，thanks for your reply