the emb value become nan when training

Question

the emb value become nan when training

justicevita opened this issue 5 years ago · comments

the same as the title.
i think maybe you can use:
self.z_mean= tf.where(tf.abs(self.z_mean)<1e-10, x= tf.zeros_like(self.z_mean,dtype=self.z_mean.dtype), y=self.z_mean)
to avoid the problem?

G. Salha-Galvan · Answer 1 · Tue Dec 31 2019 17:55:52 GMT+0800 (China Standard Time)

Dear @justicevita,

Thank you for your message. I saw that you closed this issue but I would still like to comment, because your question is very relevant and other users might face the same problem in the future.

We are aware of this. In the (non-variational) gravity graph AE model, some NaN might indeed occur during training for some graphs. It happens when two embedding vectors z_i and z_j become too close or identical, which leads to numerical issues when computing - log( ||z_i - z_j||_2^2 ) in the decoder, that then propagate in the graph during training.

To avoid such instability, we simply added a float parameter epsilon (which default value to 0.01) and chose to compute - log( ||z_i - z_j||_2^2 + epsilon) in our code. Increasing the value of epsilon should remove your NaN problem. That's what we did for "Google - Task 2" in our experiments - please see the corresponding section in the readme.

If time permits, we might consider working on a more elegant way to tackle this problem in the future.

Best,

Guillaume

P.S.: note that this is very unlikely to face this problem in the gravity graph VAE model, thanks to the z_i sampling step.