Large Variance in feature maps

Question

Large Variance in feature maps

nightsnack opened this issue 2 years ago · comments

Hi, I find the variance in each stage's feature maps grows up quickly with the depth goes up. Although exploding variance is a long last problem starting from resnet residual connection, the variance explosion in convnext seems quite serious. See the pic below. I print the variance of x in each stage forward function and got this.

Any idea of this? Cause it could soon go beyond fp16’s upper limit when the model goes deeper.

Larry Tsai · Answer 1 · Thu Jul 14 2022 15:55:58 GMT+0800 (China Standard Time)

I notice you use trunc_normal to initialize the Conv2d module here, line104 which is seldom seen in CNN initialization. Truncate normal is often seen in linear layer's initialization in transformer, but in traditional CNN, people often use kaiming normal to stabilize the output.

Are there any connections of this trun_normal and the exploding variance above?

Zhuang Liu · Answer 2 · Fri Jul 15 2022 15:28:31 GMT+0800 (China Standard Time)

Thanks for sharing your findings. Very interesting that the variance explodes with depth. Can you share your exact formula for calculating the variance?

We tried using PyTorch's default initialization for Conv2d layers but observed nearly no differences in training curves so we are not sure whether those are related.

Larry Tsai · Answer 3 · Sun Jul 17 2022 01:55:09 GMT+0800 (China Standard Time)

exact formula for calculating the variance?

Here it is

Yes, kaiming init has no inference to the exploding variance problem, but I do find something related. If you change the prenorm of downsample to post norm(like swin v2 did), the variance can reduce to 10. I'm not sure if this post norm will bring any accuracy degradation.

yxchng · Answer 4 · Sat Aug 06 2022 14:02:28 GMT+0800 (China Standard Time)

Is large variance necessarily bad? Can't it be interpreted as learning a more diverse representations, so more variance?

Can · Answer 5 · Thu Nov 03 2022 11:47:14 GMT+0800 (China Standard Time)

Im using convnext-tiny as the backbone to do a multi-instance-learning project, I visualize the a bag's feature (1000 instances) as a greyscale image(normalize values to 0~255), the shape of this image is (1000, 768)。 Grey part's value is close to zero, only some channels have negative or positive values (black & white), this is so different from resnet(image below)。

Can · Answer 6 · Thu Nov 03 2022 11:51:24 GMT+0800 (China Standard Time)

Does this mean that the final classification is only based on several black&white channels, and grey channels don't make any contribution?

Saining Xie · Answer 7 · Thu Nov 03 2022 11:52:05 GMT+0800 (China Standard Time)

Interesting. Are you sure this is properly normalized independently across architectures? ConvNeXt might have a different absolute scale in its feature maps, and some times the visualization might not tell the full story.

Can · Answer 8 · Thu Nov 03 2022 12:38:41 GMT+0800 (China Standard Time)

Interesting. Are you sure this is properly normalized independently across architectures? ConvNeXt might have a different absolute scale in its feature maps, and some times the visualization might not tell the full story.

The output values of self.forward_features(self, x) is between -10 and 10。 I concatenate them vertically and normalize to 0~255 to save as an image. This is what the original outputs look like: