Implemented one of your model in Tensorflow

Question

Implemented one of your model in Tensorflow

RiaanZoetmulder opened this issue 7 years ago · comments

Hello!

I have implemented the rationalizing neural prediction dependent model in Tensorflow. It works for the most part. It manages to create some rationalisations and its losses decrease at about the same rate as yours do.

However after about 20 iterations (depending on the Seed) it suddenly only predicts 0 for z, causing the MSE to become a NaN, and other errors to become -inf, 0 or Nan as well. My guess is that this is either due to exploding gradients, or because, that is the easiest way the model can decrease the loss, by setting its regularizers lambda_1 and lambda_2 to 0. Since I have already tried gradient clipping, I doubt it is exploding gradients. So its probably that it learns to set regularizers to 0. Did you have to deal with this during your theano experiments?

kind regards,

Riaan

PS: I have the code on my Github account: https://github.com/RiaanZoetmulder/Master-Thesis/tree/master/step_one

Tao Lei · Answer 1 · Fri Mar 10 2017 13:45:03 GMT+0800 (China Standard Time)

Hi Riaan,

There are several possible reasons:

I sometimes observed gradient explosions in Theano (in general). Most of them are caused by log-softmax or log-sigmoid operations. Calculating the gradient wrt. these operations can result in explosion. These can be resolved with better coding, such that Theano is aware of automatically optimizing log-softmax / log-sigmoid operations into softplus (http://theano.readthedocs.io/en/0.8.x/crei2013/theano.html). I don't know if TF has this feature, but this can be done manually.
For the rationale model, I've seen the issue you observed (assuming it is not due to the above reason). Since the gradient is sampled and approximated, my understanding is the model may experience "bad gradient" due to sampling variance.

I used a big batch size of 256 for the beer review dataset. Another better but more complicated solution is to use variance reduction tricks. This is a common practice for reinforcement learning, especially policy gradient learning. You can find a lot materials / tutorials online. As an example, this follow-up paper by Jiwei (https://arxiv.org/pdf/1612.08220.pdf) uses such trick.

If you work on a different dataset, you need to tune the lambda regularizers. The model is quite sensitive to the values. My personal experience is to fix lambda_2 to a small value (0~1). And then find a good choice of lambda_1 such that the model doesn't end up with producing all 1's or all 0's for z.

Hope these can help!

Tao

Riaan Zoetmulder · Answer 2 · Tue Mar 21 2017 15:58:37 GMT+0800 (China Standard Time)

They have helped, thanks! The implementation of the dependent model can be found here:

https://github.com/RiaanZoetmulder/Master-Thesis/tree/master/rationale

I have some results on the small dataset. Curious to see what you think of them :)