tensorflow / probability

Probabilistic reasoning and statistical analysis in TensorFlow

Home Page:https://www.tensorflow.org/probability/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Normal Inverse Gaussian NaN Gradient

i418c opened this issue · comments

I've been playing with tfp recently and was trying out the normal inverse gaussian distribution provided, but my model was consistently giving me a NaN loss. I ended up tracking this down to the gradient on the backward pass eventually becoming NaN, causing all of the weights to also become NaN.

A basic gist to reproduce the problem can be found here.

Is the issue with the way I tried to implement the Keras layer or is it a more fundamental issue with the library? Any guidance on this would be appreciated.

I already have validate_args set to True in my testing. It does not trigger before the gradient becomes NaN. It will trigger on the next iteration since the weights themselves have become NaN.

I ran the code for 1000 steps, and even increased the learning rate to .01 (since it seemed to be slow to get to the nans) and I never saw any. Maybe (un)lucky randomized input data? Otherwise, I'd suspect something funky with the parameters. The NIG distribution uses a bessel function; that's a very likely culprit of any unexpected behaviors.

Happy to look more closely if you can create a reproducible example (maybe fixing a seed to a more unlucky value?)

I've set the seed for TF and NP in this gist. It has to run for quite a few iterations, but it does error in the same manner. This seems pretty common because it only took me 2 attempts to find a seed.

I traced the issue to a bug in NormalInverseGaussian: https://github.com/tensorflow/probability/blob/v0.23.0/tensorflow_probability/python/distributions/normal_inverse_gaussian.py#L44.

In the dataset you constructed, it happens that the input to this function is 0 in one of the positions of verify (that is, (verify - loc) / scale is zero, or verify == loc in that position). This puts a -Inf in the negative case of the where. Notoriously, this kind of tf.where is unsafe in the case of taking gradients.

The fix is to use a double-where. I'll send a fix. Can you use tfp-nightly? Otherwise it will not be out until the next stable release (generally these are ~concurrent with TF releases)

I tried testing on nightly, but I can't get past the error where a TensorCoercible isn't actually a tensor. There's the new issue with using one as an input on a Keras layer requires it to be passed as a named argument, and it's apparently not subscriptable when input into dense layers.

Yes, I installed tf-nightly. I just did a fresh conda environment to test that it wasn't just something with the stable TF packages lying around and I'm getting the same problems.

These are the packages that got installed on the fresh environment.
keras-nightly 3.0.3.dev2024010403
tb-nightly 2.16.0a20240104
tf-estimator-nightly 2.14.0.dev2023080308
tf_keras-nightly 2.16.0.dev2023123010
tf-nightly 2.16.0.dev20240106
tfp-nightly 0.24.0.dev20240106

Hi, apologies for the delay; I have the actual fix working and tested and am just awaiting code review on our internal review system. The fix should be in later today, and available in tfp-nightly tomorrow.

I'm afraid I don't understand what issue you're running into with TensorCoercible, one as input, etc. If the TF and TFP versions are aligned things /should/ work (all our automated tests are passing). Please open another issue if you're finding an inconsistency, and we can help debug there.

Thank you for your patience and for notifying us of this(/these) bugs!

Thanks for the help!
I've created a ticket for the Keras input issue and the unsubscriptable issue.