Normal Inverse Gaussian NaN Gradient

Question

Normal Inverse Gaussian NaN Gradient

i418c opened this issue 6 months ago · comments

I've been playing with tfp recently and was trying out the normal inverse gaussian distribution provided, but my model was consistently giving me a NaN loss. I ended up tracking this down to the gradient on the backward pass eventually becoming NaN, causing all of the weights to also become NaN.

A basic gist to reproduce the problem can be found here.

Is the issue with the way I tried to implement the Keras layer or is it a more fundamental issue with the library? Any guidance on this would be appreciated.

Christopher Suter · Answer 1 · Thu Dec 28 2023 10:05:07 GMT+0800 (China Standard Time)

Try setting validate_args=True when you construct your distribution. This enables extra validations that night give a hint. Presumably some parameter going into an invalid region?

…

On Wed, Dec 27, 2023 at 20:05 i418c ***@***.***> wrote: I've been playing with tfp recently and was trying out the normal inverse gaussian distribution provided, but my model was consistently giving me a NaN loss. I ended up tracking this down to the gradient on the backward pass eventually becoming NaN, causing all of the weights to also become NaN. A basic gist to reproduce the problem can be found here <https://gist.github.com/i418c/fd3cccf006f1691b96dadf8f894b1b25>. Is the issue with the way I tried to implement the Keras layer or is it a more fundamental issue with the library? Any guidance on this would be appreciated. — Reply to this email directly, view it on GitHub <#1778>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABG2GPAR6G3EQU4IKEQ5JDYLTAT5AVCNFSM6AAAAABBE5G3IGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGA2TOOBXHA3TEMA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

i418c · Answer 2 · Sun Dec 31 2023 06:49:03 GMT+0800 (China Standard Time)

I already have validate_args set to True in my testing. It does not trigger before the gradient becomes NaN. It will trigger on the next iteration since the weights themselves have become NaN.

Christopher Suter · Answer 3 · Thu Jan 04 2024 01:00:55 GMT+0800 (China Standard Time)

I ran the code for 1000 steps, and even increased the learning rate to .01 (since it seemed to be slow to get to the nans) and I never saw any. Maybe (un)lucky randomized input data? Otherwise, I'd suspect something funky with the parameters. The NIG distribution uses a bessel function; that's a very likely culprit of any unexpected behaviors.

Happy to look more closely if you can create a reproducible example (maybe fixing a seed to a more unlucky value?)

i418c · Answer 4 · Thu Jan 04 2024 08:12:58 GMT+0800 (China Standard Time)

I've set the seed for TF and NP in this gist. It has to run for quite a few iterations, but it does error in the same manner. This seems pretty common because it only took me 2 attempts to find a seed.

Christopher Suter · Answer 5 · Fri Jan 05 2024 04:11:21 GMT+0800 (China Standard Time)

I traced the issue to a bug in NormalInverseGaussian: https://github.com/tensorflow/probability/blob/v0.23.0/tensorflow_probability/python/distributions/normal_inverse_gaussian.py#L44.

In the dataset you constructed, it happens that the input to this function is 0 in one of the positions of verify (that is, (verify - loc) / scale is zero, or verify == loc in that position). This puts a -Inf in the negative case of the where. Notoriously, this kind of tf.where is unsafe in the case of taking gradients.

The fix is to use a double-where. I'll send a fix. Can you use tfp-nightly? Otherwise it will not be out until the next stable release (generally these are ~concurrent with TF releases)

i418c · Answer 6 · Sat Jan 06 2024 12:15:03 GMT+0800 (China Standard Time)

I tried testing on nightly, but I can't get past the error where a TensorCoercible isn't actually a tensor. There's the new issue with using one as an input on a Keras layer requires it to be passed as a named argument, and it's apparently not subscriptable when input into dense layers.

Christopher Suter · Answer 7 · Sat Jan 06 2024 23:30:50 GMT+0800 (China Standard Time)

Are you using tf-nightly also?

…

On Fri, Jan 5, 2024 at 23:15 i418c ***@***.***> wrote: I tried testing on nightly, but I can't get past the error where a TensorCoercible isn't actually a tensor. There's the new issue with using one as an input on a Keras layer requires it to be passed as a named argument, and it's apparently not subscriptable when input into dense layers. — Reply to this email directly, view it on GitHub <#1778 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABG2GLQPCHRYML4UYSHOOTYNDFVDAVCNFSM6AAAAABBE5G3IGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZZGUZTCMJVHE> . You are receiving this because you commented.Message ID: ***@***.***>

i418c · Answer 8 · Sun Jan 07 2024 00:46:22 GMT+0800 (China Standard Time)

Yes, I installed tf-nightly. I just did a fresh conda environment to test that it wasn't just something with the stable TF packages lying around and I'm getting the same problems.

These are the packages that got installed on the fresh environment.
keras-nightly 3.0.3.dev2024010403
tb-nightly 2.16.0a20240104
tf-estimator-nightly 2.14.0.dev2023080308
tf_keras-nightly 2.16.0.dev2023123010
tf-nightly 2.16.0.dev20240106
tfp-nightly 0.24.0.dev20240106

Christopher Suter · Answer 9 · Mon Jan 08 2024 22:21:23 GMT+0800 (China Standard Time)

Hi, apologies for the delay; I have the actual fix working and tested and am just awaiting code review on our internal review system. The fix should be in later today, and available in tfp-nightly tomorrow.

I'm afraid I don't understand what issue you're running into with TensorCoercible, one as input, etc. If the TF and TFP versions are aligned things /should/ work (all our automated tests are passing). Please open another issue if you're finding an inconsistency, and we can help debug there.

Thank you for your patience and for notifying us of this(/these) bugs!

i418c · Answer 10 · Tue Jan 09 2024 07:23:10 GMT+0800 (China Standard Time)

Thanks for the help!
I've created a ticket for the Keras input issue and the unsubscriptable issue.