sail-sg / volo

VOLO: Vision Outlooker for Visual Recognition

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Compare to DynamicConv

toodle opened this issue · comments

Hi,

Thanks for your work.

What's main difference between VOLO and DynamicConv?

Though Convolution is not explicitly used,
Convolution is equivalent with Unfold + Matrix Multiplication + Fold (or view to output shape)

An example is provided here:https://pytorch.org/docs/stable/generated/torch.nn.Unfold.html

If no clear difference, I personally thought the claim attention-based models are indeed able to outperform CNNs in the abstract is not accurate . VOLO is more like a hybrid model based on attention and (strengthened) convolution.

Thanks for your question. The difference is clear. The outlooker in VOLO is a new attention mechanism that targets at encoding fine-level token representations. We use a linear to generate the attention weights, which are then used for value projection. See the pseudocode in the paper.

Thanks for your reply but I don't think the difference is clear.

  1. The outlooker in VOLO is a new attention mechanism that targets at encoding fine-level token representations.

So can dynamic convolution.

  1. We use a linear to generate the attention weights

The dynamic convolution weights can also be generated by linear (e.g. Sparse R-CNN).

Thanks for your reply but I don't think the difference is clear.

  1. The outlooker in VOLO is a new attention mechanism that targets at encoding fine-level token representations.

So can dynamic convolution.

  1. We use a linear to generate the attention weights

The dynamic convolution weights can also be generated by linear (e.g. Sparse R-CNN).

I think the main difference between Outlooker and dynamic convolution is how to generate the dynamic kernel.

For dynamic convolution, like CARAFE/Involution, they generate one 5x5 kernel via a 1x1 convolution at the center point.

For the outlooker, it generates one 5x5 kernel by combining/folding nine 3x3 kernels, while these nine kernels are generated at nine points around the center via different 1x1 convolutions.

Since there is no ablation study about filter generation in this paper(only compared to convolution and LSA with 3x3 kernel, have not compared to CARAFE/Involution with 5x5 kernel), not sure which one is better.