Bobo-y / flexible-yolov5

More readable and flexible yolov5 with more backbone(gcn, resnet, shufflenet, moblienet, efficientnet, hrnet, swin-transformer, etc) and (cbam,dcn and so on), and tensorrt

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Squeeze and Excitation Conv vs Linear

seermer opened this issue · comments

the current implementation of SE module in this repo uses Conv2d.
However, I noticed that the equivalent Linear layer is actually significantly faster than Conv2d for forward call (which means faster inference), maybe consider changing it? below is the code I used for timing:

import torch 
import torch.nn as nn
import time
from torch.backends import cudnn

cudnn.benchmark = True

@torch.no_grad()
def main():
    data = torch.randn(64, 512,1,1).float().to("cuda")
    shape = data.shape
    m1 = nn.Sequential(
        nn.Linear(512, 64),
        nn.ReLU(True),
        nn.Linear(64, 512),
        nn.Sigmoid()
    ).float().to("cuda")
    m2 = nn.Sequential(
        nn.Conv2d(512, 64, 1),
        nn.ReLU(True),
        nn.Conv2d(64, 512, 1),
        nn.Sigmoid()
    ).float().to("cuda")
    # make sure weights and bias are the same
    m1[0].weight.data = m2[0].weight.data.squeeze()
    m1[0].bias.data = m2[0].bias.data
    m1[2].weight.data = m2[2].weight.data.squeeze()
    m1[2].bias.data = m2[2].bias.data
    res1 = data.squeeze()
    res1 = m1(res1)
    res1 = res1.reshape(shape)
    res2 = m2(data)
    print(torch.abs(res1 - res2).sum())  # check if they are equivalent

    for _ in range(2):  # warmup
        data = data.squeeze()
        data = m1(data)
        data = data.reshape(shape)
    torch.cuda.synchronize()
    t = time.perf_counter()
    for _ in range(10000):
        data = data.squeeze()
        data = m1(data)
        data = data.reshape(shape)
    torch.cuda.synchronize()
    print(time.perf_counter() - t)

    for _ in range(2):  # warmup
        data = m2(data)
    torch.cuda.synchronize()
    t = time.perf_counter()
    for _ in range(10000):
        data = m2(data)
    torch.cuda.synchronize()
    print(time.perf_counter() - t)

if __name__ == "__main__":
    main()

output:
tensor(0.0007, device='cuda:0') # total absolute difference, small float calculation difference
2.3467748999828473
3.0033348000142723

commented

@seermer hi, thanks, can you train a resnet backbone with linear SE and conv se respectively, without too many epochs, compare their results, and then give a PR. build together.

@Bobo-y sorry, but I don't think I have any resources to train the model right now, but I am pretty sure using Linear layer is correct, the original paper (SENet paper) uses two fully connected layers (aka Linear layer), see screenshot below.
But I can still do a PR to add the LinearSEBlock module if you want.

image

commented

@seermer yes, i know that, the original se use linear, but some implementations use conv, I verified resnet18 with cbam(conv se module), can Improve mAP. I did not test se with linear. of course, there is no problem replacing it with linear.

@Bobo-y I don't think Linear SE will improve mAP at all, because it is strictly equivalent to Conv SE, even if there is any difference, it is just the result of randomization and/or initialization

commented

@Bobo-y I don't think Linear SE will improve mAP at all, because it is strictly equivalent to Conv SE, even if there is any difference, it is just the result of randomization and/or initialization

I mean: add a linear SE can imporve mAP compare with no SE backbone. Cmopare with conv SE, I don't know who is better.

@Bobo-y How about I will just do a PR that have linear version as a separate class

I'm closing this issue because this is a known issue of cuDNN and has been fixed in the nightly version with the newest cuDNN, this will no longer be a problem in the near future once nightly gets stable since they will use the same internal kernel for Conv and Linear.

edit: reference: I asked through torch forum https://discuss.pytorch.org/t/squeezeexcitation-module-design-choice/157646