pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration

Home Page:https://pytorch.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Feature Request] Layer Normalization

Kaixhin opened this issue · comments

See #1601 for previous discussion on layer normalization.

I use this:

class LayerNorm(nn.Module):

    def __init__(self, features, eps=1e-6):
        super().__init__()
        self.gamma = nn.Parameter(torch.ones(features))
        self.beta = nn.Parameter(torch.zeros(features))
        self.eps = eps

    def forward(self, x):
        mean = x.mean(-1, keepdim=True)
        std = x.std(-1, keepdim=True)
        return self.gamma * (x - mean) / (std + self.eps) + self.beta

Nice @jekbradbury - though I presume 2D versions etc. would have to be adapted to work on a per-channel basis, like batch norm? Either way, it's still probably worth having this in the library with a test.

Yeah, as usual I'm more or less forgetting that images exist. This should at least generalize to a couple different situations in NLP though, or really anywhere that channels are the last dimension and you want separate moments for every {timestep, y-position, etc.}.

This is the Layer normalization implementation in tensorflow. So How I can transfer it to pytorch implementation , how to transfer the nn.moments and etc..

def Layernorm(name, norm_axes, inputs):
  mean, var = tf.nn.moments(inputs, norm_axes, keep_dims=True)
  # Assume the 'neurons' axis is the first of norm_axes. This is the case for fully-connected and BCHW conv layers.
  n_neurons = inputs.get_shape().as_list()[norm_axes[0]]
   
  offset = lib.param(name+'.offset', np.zeros(n_neurons, dtype='float32'))
  scale = lib.param(name+'.scale', np.ones(n_neurons, dtype='float32'))
   
  # Add broadcasting dims to offset and scale (e.g. BCHW conv data)
  offset = tf.reshape(offset, [-1] + [1 for i in xrange(len(norm_axes)-1)])
  scale = tf.reshape(scale, [-1] + [1 for i in xrange(len(norm_axes)-1)])
  
  result = tf.nn.batch_normalization(inputs, mean, var, offset, scale, 1e-5)
    
  return result

I have created a naive implementation of LayerNorm for GRU recently. It makes things quite a bit slower, mainly because everything is being done manually, but hopefully it can be helpful.

Also note, that I don't divide hidden state by std, because I usually initialize the hidden state with zeros.

class LayerNormGRUCell(nn.GRUCell):
    def __init__(self, input_size, hidden_size, bias=True):
        super(LayerNormGRUCell, self).__init__(input_size, hidden_size, bias)

        self.gamma_ih = nn.Parameter(torch.ones(3 * self.hidden_size))
        self.gamma_hh = nn.Parameter(torch.ones(3 * self.hidden_size))
        self.eps = 0

    def _layer_norm_x(self, x, g, b):
        mean = x.mean(1).expand_as(x)
        std = x.std(1).expand_as(x)
        return g.expand_as(x) * ((x - mean) / (std + self.eps)) + b.expand_as(x)

    def _layer_norm_h(self, x, g, b):
        mean = x.mean(1).expand_as(x)
        return g.expand_as(x) * (x - mean) + b.expand_as(x)

    def forward(self, x, h):

        ih_rz = self._layer_norm_x(
            torch.mm(x, self.weight_ih.narrow(0, 0, 2 * self.hidden_size).transpose(0, 1)),
            self.gamma_ih.narrow(0, 0, 2 * self.hidden_size),
            self.bias_ih.narrow(0, 0, 2 * self.hidden_size))

        hh_rz = self._layer_norm_h(
            torch.mm(h, self.weight_hh.narrow(0, 0, 2 * self.hidden_size).transpose(0, 1)),
            self.gamma_hh.narrow(0, 0, 2 * self.hidden_size),
            self.bias_hh.narrow(0, 0, 2 * self.hidden_size))

        rz = torch.sigmoid(ih_rz + hh_rz)
        r = rz.narrow(1, 0, self.hidden_size)
        z = rz.narrow(1, self.hidden_size, self.hidden_size)

        ih_n = self._layer_norm_x(
            torch.mm(x, self.weight_ih.narrow(0, 2 * self.hidden_size, self.hidden_size).transpose(0, 1)),
            self.gamma_ih.narrow(0, 2 * self.hidden_size, self.hidden_size),
            self.bias_ih.narrow(0, 2 * self.hidden_size, self.hidden_size))

        hh_n = self._layer_norm_h(
            torch.mm(h, self.weight_hh.narrow(0, 2 * self.hidden_size, self.hidden_size).transpose(0, 1)),
            self.gamma_hh.narrow(0, 2 * self.hidden_size, self.hidden_size),
            self.bias_hh.narrow(0, 2 * self.hidden_size, self.hidden_size))

        n = torch.tanh(ih_n + r * hh_n)
        h = (1 - z) * n + z * h
        return h

class LayerNormGRU(nn.Module):
    def __init__(self, input_size, hidden_size, bias=True):
        super(LayerNormGRU, self).__init__()
        self.cell = LayerNormGRUCell(input_size, hidden_size, bias)
        self.weight_ih_l0 = self.cell.weight_ih
        self.weight_hh_l0 = self.cell.weight_hh
        self.bias_ih_l0 = self.cell.bias_ih
        self.bias_hh_l0 = self.cell.bias_hh

    def forward(self, xs, h):
        h = h.squeeze(0)
        ys = []
        for i in range(xs.size(0)):
            x = xs.narrow(0, i, 1).squeeze(0)
            h = self.cell(x, h)
            ys.append(h.unsqueeze(0))
        y = torch.cat(ys, 0)
        h = h.unsqueeze(0)
        return y, h

@jekbradbury Hi, Jek, is there any more efficient implementation of LayerNorm, I use your code, it is very cool, but as there are more than 50 LN in my network, I feel it becomes the bottemneck of the speed.

@jekbradbury, this is the LayerNorm for >=2D modified from your code

class LayerNorm(nn.Module):

    def __init__(self, num_features, eps=1e-5, affine=True):
        super(LayerNorm, self).__init__()
        self.num_features = num_features
        self.affine = affine
        self.eps = eps

        if self.affine:
            self.gamma = nn.Parameter(torch.Tensor(num_features).uniform_())
            self.beta = nn.Parameter(torch.zeros(num_features))

    def forward(self, x):
        shape = [-1] + [1] * (x.dim() - 1)
        mean = x.view(x.size(0), -1).mean(1).view(*shape)
        std = x.view(x.size(0), -1).std(1).view(*shape)

        y = (x - mean) / (std + self.eps)
        if self.affine:
            shape = [1, -1] + [1] * (x.dim() - 2)
            y = self.gamma.view(*shape) * y + self.beta.view(*shape)
        return y

@D-X-Y @jekbradbury I can also confirm that a simple implementation of Layernorm seems to be quite slow on Pytorch (slows down my training by at least 3x). Maybe the mean and standard deviation computation could be combined into a single CUDA kernel?

Just a quick note: @LynnHo 's version calculates the mean and std over all channels combined, but the beta and gamma are per-channel. (This is probably intended, but I was confused for a moment.)

@t-vi I just follow how tf-slim does. However, this implementation is unbearably slow. Anybody know more effective implementation?

There seems to be one for TF, maybe you can use that as inspiration: https://github.com/MycChiu/fast-LayerNorm-TF .

@LynnHo I am new to Pytorch and just going through the layer-normalization code. The mean and std will return a single value even for images in CNN? As x.size(0) represents batch size which is one in this case. Just bit confused

@Blade6570 according to the Layer Normalization paper, yes the mean and standard deviation should be a single number shared between all activations (see section 3 of the paper). This is true for all network types.

EDIT: Although the mean and standard deviation are each single numbers for all activations, the trainable gain and bias terms should be of the same size as the vector being normalized ignoring the batch size.

The paper also notes that Batch Normalization works better than Layer Normalization for Convolutional Neural Networks (see section 6.7).

Note that there is the possibility that new research has been released since the release of this paper (July 21, 2016) that I am unaware of.

Closed by #4922 - thanks @ssnl!

Are the implementations of Layer Normalization and Instance Normalization identical?

@wandering007 They are very similar, but have some subtle differences. IN is applied on each channel of channeled data like images, but LN is usually applied on entire sample and often in NLP tasks. Also LN applies elementwise affine transform, while IN usually don't apply affine transform. PyTorch does support IN with a scalar affine transform applied to each channel, but that's mostly coming from that IN is implemented with BN code.

@ssnl Thanks for your patient explanation. So if I want to use LN in the context of channeled data like images, I should use IN (with affine=True) instead? e.g. ops are convolution instead of linear.

You can use both on images. They just work differently. Depending on your design and purpose of the work, I can see either (and with either affine on or off) work.

@ssnl your explanation here was super helpful to me. I would suggest to add a bit more documentation in the IN layer description to clarify this.