About Normalization

Question

About Normalization

chenhang98 opened this issue 3 years ago · comments

Hi, thanks for your excellent work.
In your ablation studies (section 4.4), you compared Group Normalization (group number is set as 1 for simplicity), Layer Normalization, and Batch Normalization. The conclusion is that Group Normalization is 0.7% or 0.8% higher than Layer Normalization or Batch Normalization.
But when the number of groups is 1, Group Normalization is equivalent to Layer Normalization, right?

Weihao Yu commented 2 years ago

Welcome~

Weihao Yu · Answer 1 · Tue Nov 30 2021 17:04:18 GMT+0800 (China Standard Time)

Hi @tinyalpha ,

In transformers, the conventional data format is [B, W*H, C]. Transformers usually use Layer Normalization that only applies to the channel dimension. If the data format is [B, C, W, H], like that in this repo, it means it only normalizes C, not including [W, H]. Currently, there is no PyTorch API LayerNorm for [B, C, W, H] data format to only normalize channel dimension, I implement it here and name it LayerNormChannel. The Layer Normalization in the paper refers to this conventional Layer Normalization of transformers.

For data format [B, C, W, H], Group Normalization normalizes dimensions C, W, and H at the same time.

In summary (run in poolformer root dir):

import torch
import torch.nn as nn
from models.poolformer import LayerNormChannel # only normalize channel
from models.poolformer import GroupNorm # group == 1

C, W, H = 64, 56, 56 # data format [B, C, W, H]

layer_norm_channel = LayerNormChannel(C) # only normalize C, conventionally used by transformers and in this paper ablation study
print(layer_norm_channel.weight.shape) # learnable affine weight shape: torch.Size([64])

group_norm = GroupNorm(C) # normalize C, H, and W.
print(group_norm.weight.shape) # learnable affine weight shape: torch.Size([64])

layer_norm = nn.LayerNorm([C, H, W]) # normalize C, H, and W.
print(layer_norm.weight.shape) # learnable affine weight shape: torch.Size([64, 56, 56])

You may clearly know their differences from the above code. Therefore, "when the number of groups is 1, Group Normalization is equivalent to Layer Normalization" is NOT right. There is a small mistake in the PyTorch GroupNorm API explanation. The saying "Put all 6 channels into a single group (equivalent with LayerNorm)" in the API explanation is NOT right.

Hang Chen · Answer 2 · Tue Nov 30 2021 20:02:40 GMT+0800 (China Standard Time)

Thanks for your explanation!

Weihao Yu · Answer 3 · Tue Nov 30 2021 21:50:43 GMT+0800 (China Standard Time)

You are welcome~

Xu Ma · Answer 4 · Mon Dec 13 2021 12:19:40 GMT+0800 (China Standard Time)

This really helps. Thanks a lot.

iumyx2612 · Answer 5 · Mon Dec 05 2022 17:33:47 GMT+0800 (China Standard Time)

So it means, apply nn.LayerNorm on (B, C, H, W) is equals to GroupNorm with groups = 1 right?

The LayerNorm that decreases the performance in the ablation study is the LayerNormChannel not nn.LayerNorm right?

Weihao Yu · Answer 6 · Mon Dec 05 2022 18:11:12 GMT+0800 (China Standard Time)

Hi @iumyx2612 , thanks for your attention.

apply nn.LayerNorm" on (B, C, H, W) is NOT equal to GroupNormwith groups = 1. In the example given above,print(group_norm.weight.shape) # learnable affine weight shape: torch.Size([64])whileprint(layer_norm.weight.shape) # learnable affine weight shape: torch.Size([64, 56, 56])`. The difference is their affine weight and bias shape.
LayerNormChannel for input (B, C, H, W) is equal to nn.LayerNorm for input (B, N, C). Because this repo uses data format (B, C, H, W), the LayerNorm implementation in this repo is LayerNormChannel corresponding to the results of the ablation study.

In summary, to identify whether two Normalizations are equal or not, the first is to see which dims are used to compute mean and variance, the second is to see whether their shape of affine weight and bias.
For (B, C, H, W) input,

Norm	Which dims to compute mean and variance	Shape of affine weight and bias
`LayerNormChannel(num_channels=C)`	(C, )	(C, )
`nn.GroupNorm(num_groups=1, num_channels=C)`	(C, H, W)	(C, )
`nn.LayerNorm(normalized_shape=(C, H, W))`	(C, H, W)	(C, H, W)

For (B, N, C) input (like timm's ViT implementation),

Norm	Which dims to compute mean and variance	Shape of affine weight and bias
`nn.LayerNorm(normalized_shape=C)`	(C, )	(C, )

You may also refer to another example that I implement general LayerNorm for different situations for data format (B, H, W, C).

iumyx2612 · Answer 7 · Tue Dec 06 2022 10:18:41 GMT+0800 (China Standard Time)

Hi @iumyx2612 , thanks for your attention.
1. apply `nn.LayerNorm" on (B, C, H, W) is NOT equal to `GroupNorm`with groups = 1. In the example given above,`print(group_norm.weight.shape) # learnable affine weight shape: torch.Size([64])`while`print(layer_norm.weight.shape) # learnable affine weight shape: torch.Size([64, 56, 56])`. The difference is their affine weight and bias shape.

2. LayerNormChannel for input (B, C, H, W) is equal to `nn.LayerNorm` for input (B, N, C). Because this repo uses data format (B, C, H, W), the LayerNorm implementation in this repo is `LayerNormChannel` corresponding to the results of the ablation study.
In summary, to identify whether two Normalizations are equal or not, the first is to see which dims are used to compute mean and variance, the second is to see whether their shape of affine weight and bias. For (B, C, H, W) input,
Norm Which dims to compute mean and variance Shape of affine weight and bias
LayerNormChannel(num_channels=C) (C, ) (C, )
nn.GroupNorm(num_groups=1, num_channels=C) (C, H, W) (C, )
nn.LayerNorm(normalized_shape=(C, H, W)) (C, H, W) (C, H, W)

For (B, N, C) input (like timm's ViT implementation),
Norm Which dims to compute mean and variance Shape of affine weight and bias
nn.LayerNorm(normalized_shape=C) (C, ) (C, )

You may also refer to another example that I implement general LayerNorm for different situations for data format (B, H, W, C).

Thanks for the detail explanation, I can see nn.LayerNorm and nn.GroupNorm(num_groups=1) compute mean and variance on the same dimension. However, the weight for the Norm layers are different. Both nn.LayerNorm and nn.GroupNorm(num_groups=1) do the same job in this situation (the dims to compute mean and var are similar), so using nn.LayerNorm performs worse than nn.GroupNorm or better? Since the LayerNorm in the abalation study is LayerNormChannel (which only compute mean and var on C) not nn.LayerNorm (which compute mean and var similar as nn.GroupNorm)

Weihao Yu · Answer 8 · Tue Dec 06 2022 14:02:55 GMT+0800 (China Standard Time)

Hi @iumyx2612 , yes, the difference between nn.LayerNorm(normalized_shape=(C, H, W)) and nn.GroupNorm(num_groups=1, num_channels=C) is their shapes of affine weight and bias. And the difference between nn.GroupNorm(num_groups=1, num_channels=C) and ViT's LayerNorm is the dims to compute mean and variance. For your question, I did not conduct experiments for nn.LayerNorm(normalized_shape=(C, H, W)) because it means the model can only accept one resolution. So I don't have experiment results on hand for the comparison between nn.LayerNorm(normalized_shape=(C, H, W)) and nn.GroupNorm(num_groups=1, num_channels=C) and sorry that I cannot give clear conclusion for it.

iumyx2612 · Answer 9 · Tue Dec 06 2022 14:04:23 GMT+0800 (China Standard Time)

Hi @iumyx2612 , yes, the difference between nn.LayerNorm(normalized_shape=(C, H, W)) and nn.GroupNorm(num_groups=1, num_channels=C) is their shapes of affine weight and bias. And the difference between nn.GroupNorm(num_groups=1, num_channels=C) and ViT's LayerNorm is the dims to compute mean and variance. For your question, I did not conduct experiments for nn.LayerNorm(normalized_shape=(C, H, W)) because it means the model can only accept one resolution. So I don't have experiment results on hand for the comparison between nn.LayerNorm(normalized_shape=(C, H, W)) and nn.GroupNorm(num_groups=1, num_channels=C) and sorry that I cannot give clear conclusion for it.

Very clear explanation, thank you for taking your time to answer me! Much appreciated