lmnt-com / haste

Haste: a fast, simple, and open RNN library

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Layer normalization

usamec opened this issue · comments

It would be nice to support some form of layer normalization in LSTM and GRU layer (example https://github.com/pytorch/pytorch/blob/master/benchmarks/fastrnns/custom_lstms.py#L171)

Hmm that's an interesting implementation. They're applying layer norm to c_t in addition to h_t. The supplementary material in Ba et al. (pp. 13–14) only applies layer norm to h_t in both of their LSTM variants.

Do you know if there's any follow-up literature that explains the PyTorch variant?

@sharvil I do not about any. I personally think that any variant of GRU/LSTM with LayerNorm would be great addition.

Here's what the haste.LayerNormLSTM implementation looks like:



This implementation is nearly identical to eqs. 20–22 of the layer norm paper. The differences are:

  1. we don't apply a bias term to layer norms on the input or recurrent connection; these parameters are unnecessary since there's already a bias term (... + b) applied by the LSTM
  2. we use instead of to denote the gain parameter (notation change)
  3. we initialize to 1 and to 0 instead of the other way around (seems like a typo in the paper)

I haven't gotten around to updating the docs yet but haste.LSTM can just be replaced with haste.LayerNormLSTM. Zoneout, DropConnect, etc. are all supported in LayerNormLSTM as well.

Nice! Having GRU would be also great, but we can probably manage with LSTMs :)

Our LSTM implementation is much further ahead than the GRU one so we started with LSTMs first. When we do the GRU updates, we'll keep LayerNorm in mind. Thanks for the feature request!