Gradient rescaling using Baum-Welch statistics & Pytorch implementation?

Question

Gradient rescaling using Baum-Welch statistics & Pytorch implementation?

djelkind opened this issue 3 years ago · comments

Hi,

I'm really excited about this project. The corresponding paper outlines exactly some of the goals I have for a specific model. I was glad to find the paper and this repository.

I've been reviewing the code in preparation to write a pytorch implementation. But since I don't want to redo work that someone's already done, I'd like to ask if you're aware of an existing pytorch implementation before I get too deep into writing code.

And if you're not aware of a pytroch implementation of a neuralHMM, would you be interested in collaborating to write one?

Thanks,

dje · Answer 1 · Thu Jul 22 2021 03:10:52 GMT+0800 (China Standard Time)

I have been able to implement Baum-Welch and pytorch neural networks to estimate all parts of a HMM (prior probability vector, emission probabilities, and transition matrix). This means that each of the HMM components can be a vector/matrix, or a neural network which uses some kind of input data to yield a vector/matrix/parameterized distribution, very much like the neural network-HMM models implemented in the NeuralHMM paper. These networks are trained by gradient descent. In my experiments, I've found that the networks are successfully able to recover the true parameters for some simple toy problems (with both Gaussian and multinomial observed data).

However, the details of the implementation differ slightly from the methods of the Lua implementation. I'm not fluent in Lua, so it's possible that this Lua implementation and my pytorch implementation are doing the same thing, just with different mechanisms. Specifically, my pytorch implementation solely uses the negative log-likelihood obtained from the BW forward pass. The Baum-Welch backward pass and associated statistics are not used. To be explicit, the training process is to do the BW forward pass to get the log likelihood as from https://github.com/ketranm/neuralHMM/blob/master/BaumWelch.lua#L198 , take the negative of the log likelihood, calling backward() on the negative log-likelihood, and then the optimizer takes a gradient step.

I can't tell if this is different from how the Lua code works. Specifically, the Lua code takes some statistics from the Baum-Welch backwards pass, and these are used by the update methods for the emission, prior and transition networks to rescale the gradients. The purpose of this is opaque to me. The Lua code I'm referring to begins at https://github.com/ketranm/neuralHMM/blob/master/main.lua#L174

In my experiments, emulating the gradient-rescaling behavior from the Lua implementation did not help, and actually stopped the network from improving at all. (Of course, I might have not implemented it correctly.) By contrast, simply using the negative log-likelihood directly as a loss without taking any special steps to rescale the gradients trains a network seems to work just fine. Moreover, the network parameters converge to their true values in test some simple problems. Of course, this is not a definitive proof of correctness, but it is consistent with the desired behavior.

Is the requirement to use the statistics obtained from the backward pass of Baum-Welch in Lua just a matter of differences in how Lua and Pytorch work? Or am I missing a key fact about how the gradient updates should be carried out in neural network-HMM models?

Thanks,

Ke Tran · Answer 2 · Thu Jul 22 2021 03:38:28 GMT+0800 (China Standard Time)

Hi,
The Lua implementation exists before Pytorch, I don't think you need to translate Lua code to pytorch one. The reason that I had to write manually backward and forward of Baum-Welch is that Lua does not support auto-diff. The backward pass of Baum-Welch is equivalant to calling loss.backward() in pytorch, where the loss is computed by the forward pass of Baum-Welch via dynamic programming. The detail can be found in this paper: https://aclanthology.org/W16-5901/ . Having said that, the whole computation that collects statistics from backward Baum-Welch is exactly what you get implicitly in your implementation when you call backward on the negative log-likelihood.

The purpose of scaling factor is just to speed up training with Lua code. Pytorch perhaps has efficient way of backtracking computational graph, so I don't think you need to emulate Lua code at all.

If you want to use HMM in pytorch, I'd recommend torch-struct: https://github.com/harvardnlp/pytorch-struct

Cheers,

dje · Answer 3 · Thu Jul 22 2021 03:43:35 GMT+0800 (China Standard Time)

Thank you so much for clearing this up. It's incredibly helpful to have this confirmation. Cheers!