Squeeze and Excitation Conv vs Linear
seermer opened this issue · comments
the current implementation of SE module in this repo uses Conv2d.
However, I noticed that the equivalent Linear layer is actually significantly faster than Conv2d for forward call (which means faster inference), maybe consider changing it? below is the code I used for timing:
import torch
import torch.nn as nn
import time
from torch.backends import cudnn
cudnn.benchmark = True
@torch.no_grad()
def main():
data = torch.randn(64, 512,1,1).float().to("cuda")
shape = data.shape
m1 = nn.Sequential(
nn.Linear(512, 64),
nn.ReLU(True),
nn.Linear(64, 512),
nn.Sigmoid()
).float().to("cuda")
m2 = nn.Sequential(
nn.Conv2d(512, 64, 1),
nn.ReLU(True),
nn.Conv2d(64, 512, 1),
nn.Sigmoid()
).float().to("cuda")
# make sure weights and bias are the same
m1[0].weight.data = m2[0].weight.data.squeeze()
m1[0].bias.data = m2[0].bias.data
m1[2].weight.data = m2[2].weight.data.squeeze()
m1[2].bias.data = m2[2].bias.data
res1 = data.squeeze()
res1 = m1(res1)
res1 = res1.reshape(shape)
res2 = m2(data)
print(torch.abs(res1 - res2).sum()) # check if they are equivalent
for _ in range(2): # warmup
data = data.squeeze()
data = m1(data)
data = data.reshape(shape)
torch.cuda.synchronize()
t = time.perf_counter()
for _ in range(10000):
data = data.squeeze()
data = m1(data)
data = data.reshape(shape)
torch.cuda.synchronize()
print(time.perf_counter() - t)
for _ in range(2): # warmup
data = m2(data)
torch.cuda.synchronize()
t = time.perf_counter()
for _ in range(10000):
data = m2(data)
torch.cuda.synchronize()
print(time.perf_counter() - t)
if __name__ == "__main__":
main()
output:
tensor(0.0007, device='cuda:0') # total absolute difference, small float calculation difference
2.3467748999828473
3.0033348000142723
@seermer hi, thanks, can you train a resnet backbone with linear SE and conv se respectively, without too many epochs, compare their results, and then give a PR. build together.
@Bobo-y sorry, but I don't think I have any resources to train the model right now, but I am pretty sure using Linear layer is correct, the original paper (SENet paper) uses two fully connected layers (aka Linear layer), see screenshot below.
But I can still do a PR to add the LinearSEBlock module if you want.
@seermer yes, i know that, the original se use linear, but some implementations use conv, I verified resnet18 with cbam(conv se module), can Improve mAP. I did not test se with linear. of course, there is no problem replacing it with linear.
@Bobo-y I don't think Linear SE will improve mAP at all, because it is strictly equivalent to Conv SE, even if there is any difference, it is just the result of randomization and/or initialization
@Bobo-y I don't think Linear SE will improve mAP at all, because it is strictly equivalent to Conv SE, even if there is any difference, it is just the result of randomization and/or initialization
I mean: add a linear SE can imporve mAP compare with no SE backbone. Cmopare with conv SE, I don't know who is better.
@Bobo-y How about I will just do a PR that have linear version as a separate class
I'm closing this issue because this is a known issue of cuDNN and has been fixed in the nightly version with the newest cuDNN, this will no longer be a problem in the near future once nightly gets stable since they will use the same internal kernel for Conv and Linear.
edit: reference: I asked through torch forum https://discuss.pytorch.org/t/squeezeexcitation-module-design-choice/157646