the emb value become nan when training
justicevita opened this issue · comments
the same as the title.
i think maybe you can use:
self.z_mean= tf.where(tf.abs(self.z_mean)<1e-10, x= tf.zeros_like(self.z_mean,dtype=self.z_mean.dtype), y=self.z_mean)
to avoid the problem?
Dear @justicevita,
Thank you for your message. I saw that you closed this issue but I would still like to comment, because your question is very relevant and other users might face the same problem in the future.
We are aware of this. In the (non-variational) gravity graph AE model, some NaN
might indeed occur during training for some graphs. It happens when two embedding vectors z_i and z_j become too close or identical, which leads to numerical issues when computing - log( ||z_i - z_j||_2^2 ) in the decoder, that then propagate in the graph during training.
To avoid such instability, we simply added a float parameter epsilon
(which default value to 0.01
) and chose to compute - log( ||z_i - z_j||_2^2 + epsilon) in our code. Increasing the value of epsilon
should remove your NaN
problem. That's what we did for "Google - Task 2" in our experiments - please see the corresponding section in the readme.
If time permits, we might consider working on a more elegant way to tackle this problem in the future.
Best,
Guillaume
P.S.: note that this is very unlikely to face this problem in the gravity graph VAE model, thanks to the z_i sampling step.