[Potential NAN bug] Loss may become NAN during training

Question

[Potential NAN bug] Loss may become NAN during training

Justobe opened this issue 4 years ago · comments

Hello~

Thank you very much for sharing the code!

I try to use my own data set ( with the same shape as mnist) in code. After some iterations, it is found that the training loss becomes NAN. After carefully checking the code, I found that the following code may trigger NAN in loss:

In TensorFlow-Examples/examples/2_BasicModels/logistic_regression.py:

cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(pred), reduction_indices=1))

If pred contains 0 (output of softmax ), the result of tf.log(pred) is inf because log(0) is illegal . And this may cause the result of loss to become NAN.

It could be fixed by making the following changes:

cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(pred + 1e-10), reduction_indices=1))

or

cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(tf.clip_by_value(pred,1e-10,1.0)), reduction_indices=1))

Hope to hear from you ~

Thanks in advance! : )

yenming · Answer 1 · Wed Jan 13 2021 11:52:20 GMT+0800 (China Standard Time)

@aymericdamien