facebookresearch / ConvNeXt

The reason to rewrite the 'class LayerNorm(nn.Module)' is that you think the layer normal provided by PyTorch only supports 'channels_last' format (batch_size, height, width, channels), so you rewrite a new way to support 'channels_first' format (batch_size, channels, height, width).
However, I found the F.layer_norm or nn.LayerNorm do not require the order of channels, height and width. Because F.layer_norm will derive the calculated dimensions from the last dim using 'normalized_shape' to calculate the mean and variance.

Specifically, the PyTorch implementation uses the every value in a image to calculate a pair of mean and variance, and every value in the image use this two numbers to do LayerNorm. But your implementation uses the values over channels in every spatial point to get a pair of mean and variance in every spatial point.

When I changed the following codes in convnext.py, I found I do the same thing as 'F.layer_norm' or 'nn.LayerNorm' by PyTorch.

ConvNeXt/models/convnext.py

Line 119 in d1fa8f6

class LayerNorm(nn.Module):

u = a.mean([1, 2, 3], keepdim=True)
# u = x.mean(1, keepdim=True)  # original code
s = (x - u).pow(2).mean([1, 2, 3], keepdim=True)
# s = (x - u).pow(2).mean(1, keepdim=True)  # original code
x = self.weight[None, :] * x + self.bias[None, :]
# x = self.weight[:, None, None] * x + self.bias[:, None, None]  # original code

There is no need to rewrite the 'class LayerNorm(nn.Module)', it's just a misunderstanding about LayerNorm implementation.

Now I consider the 'layer normalization' in your code is different from the original layer normalization

The original layer normalization should be like this:

But all the 'layer normalization' in ConvNeXt is like this (it's looks more like kind of 'Depth-wise Normal', I think this name is more appropriate):

So you chose this kind of 'LayerNorm'(or 'Depth-wise Normal') for convenience of implementation by PyTorch?

It's because the LayerNorm in Transformers generally only normalizes over only the channel dimension, without normalizing token/spatial dimensions, so we followed them.

I don't think the LN figure illustration in the GroupNorm paper represents the "original" LN. The original LN was developed in RNN and there was only a channel dimension without token/spatial dimension in each layer.

Thanks for your explanation. !(^_^)!
Sorry for that I'm not familiar with NLP. Your explanation makes sense. Now I know why you rewrite the LayerNorm, it seems to be kind of difference between image data format and text data format, I agree.

But I think the regularization approach explained in GroupNorm paper may be more consistent with the name layer norm in image processing. The original LayerNorm paper named the way "Layer" because it uses all the inputs in a layer to compute two terms.

I wonder if this change will make results a little better or not. Anyway, it doesn't affect your wonferful job! (>_o)

If I understand correctly, normalizing all C,H,W dimensions is equivalent to a GroupNorm with #groups=1. We haven't got a chance to try this though. The Poolformer paper uses this as their default

FYI, LayerNorm paper's section 6.7 talks about CNNs. Although it does not clearly say how it is applied to (N, C, H, W), the words does have some hints:

With fully connected layers, all the hidden
units in a layer tend to make similar contributions to the final prediction and re-centering and rescaling the summed inputs to a layer works well. However, the assumption of similar contributions
is no longer true for convolutional neural networks. The large number of the hidden units whose
receptive fields lie near the boundary of the image are rarely turned on and thus have very different
statistics from the rest of the hidden units within the same layer.

My reading of it is that the "original" LayerNorm does normalize over (C, H, W) (and they think this might not be a good idea).

Although today in Transformer's point of view, H and W becomes "sequence" and then it becomes natural to normalize only on C dimension. And btw, "positional normalization" https://arxiv.org/pdf/1907.04312.pdf seem to be the first one to formally name such an operation for CNN.

there is no need to rewrite the 'class LayerNorm(nn.Module)'