Squeeze and Excitation Conv vs Linear

Question

Squeeze and Excitation Conv vs Linear

seermer opened this issue 2 years ago · comments

the current implementation of SE module in this repo uses Conv2d.
However, I noticed that the equivalent Linear layer is actually significantly faster than Conv2d for forward call (which means faster inference), maybe consider changing it? below is the code I used for timing:

import torch 
import torch.nn as nn
import time
from torch.backends import cudnn

cudnn.benchmark = True

@torch.no_grad()
def main():
    data = torch.randn(64, 512,1,1).float().to("cuda")
    shape = data.shape
    m1 = nn.Sequential(
        nn.Linear(512, 64),
        nn.ReLU(True),
        nn.Linear(64, 512),
        nn.Sigmoid()
    ).float().to("cuda")
    m2 = nn.Sequential(
        nn.Conv2d(512, 64, 1),
        nn.ReLU(True),
        nn.Conv2d(64, 512, 1),
        nn.Sigmoid()
    ).float().to("cuda")
    # make sure weights and bias are the same
    m1[0].weight.data = m2[0].weight.data.squeeze()
    m1[0].bias.data = m2[0].bias.data
    m1[2].weight.data = m2[2].weight.data.squeeze()
    m1[2].bias.data = m2[2].bias.data
    res1 = data.squeeze()
    res1 = m1(res1)
    res1 = res1.reshape(shape)
    res2 = m2(data)
    print(torch.abs(res1 - res2).sum())  # check if they are equivalent

    for _ in range(2):  # warmup
        data = data.squeeze()
        data = m1(data)
        data = data.reshape(shape)
    torch.cuda.synchronize()
    t = time.perf_counter()
    for _ in range(10000):
        data = data.squeeze()
        data = m1(data)
        data = data.reshape(shape)
    torch.cuda.synchronize()
    print(time.perf_counter() - t)

    for _ in range(2):  # warmup
        data = m2(data)
    torch.cuda.synchronize()
    t = time.perf_counter()
    for _ in range(10000):
        data = m2(data)
    torch.cuda.synchronize()
    print(time.perf_counter() - t)

if __name__ == "__main__":
    main()

output:
tensor(0.0007, device='cuda:0') # total absolute difference, small float calculation difference
2.3467748999828473
3.0033348000142723

Bobo~ · Answer 1 · Thu Jul 28 2022 09:31:43 GMT+0800 (China Standard Time)

@seermer hi, thanks, can you train a resnet backbone with linear SE and conv se respectively, without too many epochs, compare their results, and then give a PR. build together.

Zhantao Yang · Answer 2 · Thu Jul 28 2022 10:52:10 GMT+0800 (China Standard Time)

@Bobo-y sorry, but I don't think I have any resources to train the model right now, but I am pretty sure using Linear layer is correct, the original paper (SENet paper) uses two fully connected layers (aka Linear layer), see screenshot below.
But I can still do a PR to add the LinearSEBlock module if you want.

Bobo~ · Answer 3 · Thu Jul 28 2022 11:20:57 GMT+0800 (China Standard Time)

@seermer yes, i know that, the original se use linear, but some implementations use conv, I verified resnet18 with cbam(conv se module), can Improve mAP. I did not test se with linear. of course, there is no problem replacing it with linear.

Zhantao Yang · Answer 4 · Thu Jul 28 2022 11:24:31 GMT+0800 (China Standard Time)

@Bobo-y I don't think Linear SE will improve mAP at all, because it is strictly equivalent to Conv SE, even if there is any difference, it is just the result of randomization and/or initialization

Bobo~ · Answer 5 · Thu Jul 28 2022 11:27:04 GMT+0800 (China Standard Time)

@Bobo-y I don't think Linear SE will improve mAP at all, because it is strictly equivalent to Conv SE, even if there is any difference, it is just the result of randomization and/or initialization

I mean: add a linear SE can imporve mAP compare with no SE backbone. Cmopare with conv SE, I don't know who is better.

Zhantao Yang · Answer 6 · Thu Jul 28 2022 11:31:02 GMT+0800 (China Standard Time)

@Bobo-y How about I will just do a PR that have linear version as a separate class

Zhantao Yang · Answer 7 · Sat Jul 30 2022 04:15:24 GMT+0800 (China Standard Time)

I'm closing this issue because this is a known issue of cuDNN and has been fixed in the nightly version with the newest cuDNN, this will no longer be a problem in the near future once nightly gets stable since they will use the same internal kernel for Conv and Linear.

edit: reference: I asked through torch forum https://discuss.pytorch.org/t/squeezeexcitation-module-design-choice/157646