[Feature Request] Layer Normalization

Question

[Feature Request] Layer Normalization

Kaixhin opened this issue 7 years ago · comments

See #1601 for previous discussion on layer normalization.

James Bradbury · Answer 1 · Sat Jul 01 2017 04:20:21 GMT+0800 (China Standard Time)

I use this:

class LayerNorm(nn.Module):

    def __init__(self, features, eps=1e-6):
        super().__init__()
        self.gamma = nn.Parameter(torch.ones(features))
        self.beta = nn.Parameter(torch.zeros(features))
        self.eps = eps

    def forward(self, x):
        mean = x.mean(-1, keepdim=True)
        std = x.std(-1, keepdim=True)
        return self.gamma * (x - mean) / (std + self.eps) + self.beta

Kai Arulkumaran · Answer 2 · Sat Jul 01 2017 04:31:48 GMT+0800 (China Standard Time)

Nice @jekbradbury - though I presume 2D versions etc. would have to be adapted to work on a per-channel basis, like batch norm? Either way, it's still probably worth having this in the library with a test.

James Bradbury · Answer 3 · Sat Jul 01 2017 04:39:45 GMT+0800 (China Standard Time)

Yeah, as usual I'm more or less forgetting that images exist. This should at least generalize to a couple different situations in NLP though, or really anywhere that channels are the last dimension and you want separate moments for every {timestep, y-position, etc.}.

Wei Han · Answer 4 · Thu Jul 06 2017 12:02:25 GMT+0800 (China Standard Time)

This is the Layer normalization implementation in tensorflow. So How I can transfer it to pytorch implementation , how to transfer the nn.moments and etc..

def Layernorm(name, norm_axes, inputs):
  mean, var = tf.nn.moments(inputs, norm_axes, keep_dims=True)
  # Assume the 'neurons' axis is the first of norm_axes. This is the case for fully-connected and BCHW conv layers.
  n_neurons = inputs.get_shape().as_list()[norm_axes[0]]
   
  offset = lib.param(name+'.offset', np.zeros(n_neurons, dtype='float32'))
  scale = lib.param(name+'.scale', np.ones(n_neurons, dtype='float32'))
   
  # Add broadcasting dims to offset and scale (e.g. BCHW conv data)
  offset = tf.reshape(offset, [-1] + [1 for i in xrange(len(norm_axes)-1)])
  scale = tf.reshape(scale, [-1] + [1 for i in xrange(len(norm_axes)-1)])
  
  result = tf.nn.batch_normalization(inputs, mean, var, offset, scale, 1e-5)
    
  return result

Denis Yarats · Answer 5 · Fri Jul 07 2017 08:27:31 GMT+0800 (China Standard Time)

I have created a naive implementation of LayerNorm for GRU recently. It makes things quite a bit slower, mainly because everything is being done manually, but hopefully it can be helpful.

Also note, that I don't divide hidden state by std, because I usually initialize the hidden state with zeros.

class LayerNormGRUCell(nn.GRUCell):
    def __init__(self, input_size, hidden_size, bias=True):
        super(LayerNormGRUCell, self).__init__(input_size, hidden_size, bias)

        self.gamma_ih = nn.Parameter(torch.ones(3 * self.hidden_size))
        self.gamma_hh = nn.Parameter(torch.ones(3 * self.hidden_size))
        self.eps = 0

    def _layer_norm_x(self, x, g, b):
        mean = x.mean(1).expand_as(x)
        std = x.std(1).expand_as(x)
        return g.expand_as(x) * ((x - mean) / (std + self.eps)) + b.expand_as(x)

    def _layer_norm_h(self, x, g, b):
        mean = x.mean(1).expand_as(x)
        return g.expand_as(x) * (x - mean) + b.expand_as(x)

    def forward(self, x, h):

        ih_rz = self._layer_norm_x(
            torch.mm(x, self.weight_ih.narrow(0, 0, 2 * self.hidden_size).transpose(0, 1)),
            self.gamma_ih.narrow(0, 0, 2 * self.hidden_size),
            self.bias_ih.narrow(0, 0, 2 * self.hidden_size))

        hh_rz = self._layer_norm_h(
            torch.mm(h, self.weight_hh.narrow(0, 0, 2 * self.hidden_size).transpose(0, 1)),
            self.gamma_hh.narrow(0, 0, 2 * self.hidden_size),
            self.bias_hh.narrow(0, 0, 2 * self.hidden_size))

        rz = torch.sigmoid(ih_rz + hh_rz)
        r = rz.narrow(1, 0, self.hidden_size)
        z = rz.narrow(1, self.hidden_size, self.hidden_size)

        ih_n = self._layer_norm_x(
            torch.mm(x, self.weight_ih.narrow(0, 2 * self.hidden_size, self.hidden_size).transpose(0, 1)),
            self.gamma_ih.narrow(0, 2 * self.hidden_size, self.hidden_size),
            self.bias_ih.narrow(0, 2 * self.hidden_size, self.hidden_size))

        hh_n = self._layer_norm_h(
            torch.mm(h, self.weight_hh.narrow(0, 2 * self.hidden_size, self.hidden_size).transpose(0, 1)),
            self.gamma_hh.narrow(0, 2 * self.hidden_size, self.hidden_size),
            self.bias_hh.narrow(0, 2 * self.hidden_size, self.hidden_size))

        n = torch.tanh(ih_n + r * hh_n)
        h = (1 - z) * n + z * h
        return h

class LayerNormGRU(nn.Module):
    def __init__(self, input_size, hidden_size, bias=True):
        super(LayerNormGRU, self).__init__()
        self.cell = LayerNormGRUCell(input_size, hidden_size, bias)
        self.weight_ih_l0 = self.cell.weight_ih
        self.weight_hh_l0 = self.cell.weight_hh
        self.bias_ih_l0 = self.cell.bias_ih
        self.bias_hh_l0 = self.cell.bias_hh

    def forward(self, xs, h):
        h = h.squeeze(0)
        ys = []
        for i in range(xs.size(0)):
            x = xs.narrow(0, i, 1).squeeze(0)
            h = self.cell(x, h)
            ys.append(h.unsqueeze(0))
        y = torch.cat(ys, 0)
        h = h.unsqueeze(0)
        return y, h

Xuanyi Dong · Answer 6 · Wed Aug 09 2017 02:38:26 GMT+0800 (China Standard Time)

@jekbradbury Hi, Jek, is there any more efficient implementation of LayerNorm, I use your code, it is very cool, but as there are more than 50 LN in my network, I feel it becomes the bottemneck of the speed.

Zhenliang He · Answer 7 · Fri Nov 10 2017 18:24:44 GMT+0800 (China Standard Time)

@jekbradbury, this is the LayerNorm for >=2D modified from your code

class LayerNorm(nn.Module):

    def __init__(self, num_features, eps=1e-5, affine=True):
        super(LayerNorm, self).__init__()
        self.num_features = num_features
        self.affine = affine
        self.eps = eps

        if self.affine:
            self.gamma = nn.Parameter(torch.Tensor(num_features).uniform_())
            self.beta = nn.Parameter(torch.zeros(num_features))

    def forward(self, x):
        shape = [-1] + [1] * (x.dim() - 1)
        mean = x.view(x.size(0), -1).mean(1).view(*shape)
        std = x.view(x.size(0), -1).std(1).view(*shape)

        y = (x - mean) / (std + self.eps)
        if self.affine:
            shape = [1, -1] + [1] * (x.dim() - 2)
            y = self.gamma.view(*shape) * y + self.beta.view(*shape)
        return y

Henry Mao · Answer 8 · Tue Nov 14 2017 11:33:17 GMT+0800 (China Standard Time)

@D-X-Y @jekbradbury I can also confirm that a simple implementation of Layernorm seems to be quite slow on Pytorch (slows down my training by at least 3x). Maybe the mean and standard deviation computation could be combined into a single CUDA kernel?

Thomas Viehmann · Answer 9 · Thu Nov 30 2017 17:04:58 GMT+0800 (China Standard Time)

Just a quick note: @LynnHo 's version calculates the mean and std over all channels combined, but the beta and gamma are per-channel. (This is probably intended, but I was confused for a moment.)

Zhenliang He · Answer 10 · Fri Dec 01 2017 11:26:05 GMT+0800 (China Standard Time)

@t-vi I just follow how tf-slim does. However, this implementation is unbearably slow. Anybody know more effective implementation?

Thomas Viehmann · Answer 11 · Fri Dec 01 2017 16:06:42 GMT+0800 (China Standard Time)

There seems to be one for TF, maybe you can use that as inspiration: https://github.com/MycChiu/fast-LayerNorm-TF .

Soumya Tripathy · Answer 12 · Fri Dec 15 2017 06:31:21 GMT+0800 (China Standard Time)

@LynnHo I am new to Pytorch and just going through the layer-normalization code. The mean and std will return a single value even for images in CNN? As x.size(0) represents batch size which is one in this case. Just bit confused

Satchel Grant · Answer 13 · Mon Feb 19 2018 05:18:04 GMT+0800 (China Standard Time)

@Blade6570 according to the Layer Normalization paper, yes the mean and standard deviation should be a single number shared between all activations (see section 3 of the paper). This is true for all network types.

EDIT: Although the mean and standard deviation are each single numbers for all activations, the trainable gain and bias terms should be of the same size as the vector being normalized ignoring the batch size.

The paper also notes that Batch Normalization works better than Layer Normalization for Convolutional Neural Networks (see section 6.7).

Note that there is the possibility that new research has been released since the release of this paper (July 21, 2016) that I am unaware of.

Kai Arulkumaran · Answer 14 · Fri Feb 23 2018 01:03:53 GMT+0800 (China Standard Time)

Closed by #4922 - thanks @ssnl!

Changmao Cheng · Answer 15 · Sat Feb 24 2018 14:31:09 GMT+0800 (China Standard Time)

Are the implementations of Layer Normalization and Instance Normalization identical?

Tongzhou Wang · Answer 16 · Sat Feb 24 2018 14:49:01 GMT+0800 (China Standard Time)

@wandering007 They are very similar, but have some subtle differences. IN is applied on each channel of channeled data like images, but LN is usually applied on entire sample and often in NLP tasks. Also LN applies elementwise affine transform, while IN usually don't apply affine transform. PyTorch does support IN with a scalar affine transform applied to each channel, but that's mostly coming from that IN is implemented with BN code.

Changmao Cheng · Answer 17 · Sat Feb 24 2018 14:56:59 GMT+0800 (China Standard Time)

@ssnl Thanks for your patient explanation. So if I want to use LN in the context of channeled data like images, I should use IN (with affine=True) instead? e.g. ops are convolution instead of linear.

Tongzhou Wang · Answer 18 · Sat Feb 24 2018 15:00:54 GMT+0800 (China Standard Time)

You can use both on images. They just work differently. Depending on your design and purpose of the work, I can see either (and with either affine on or off) work.

Fabian-Robert Stöter · Answer 19 · Sat Sep 01 2018 05:54:07 GMT+0800 (China Standard Time)

@ssnl your explanation here was super helpful to me. I would suggest to add a bit more documentation in the IN layer description to clarify this.

Tongzhou Wang · Answer 20 · Sun Sep 02 2018 00:39:33 GMT+0800 (China Standard Time)

@faroit Adding in #11106