SHI-Labs / Neighborhood-Attention-Transformer

Neighborhood Attention Transformer, arxiv 2022 / CVPR 2023. Dilated Neighborhood Attention Transformer, arxiv 2022

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

This seems to be an idea that has been demonstrated by the existing method.

lartpang opened this issue · comments

commented

First of all, it is really a very interesting work.

This article has a very similar strategy to one I've read: Stand-Alone Self-Attention in Vision Models.

However, I did not find a relevant comparison in the paper, and the author should probably add more content to explain the difference.

image

image

Hello and thank you for your interest.

Thank you for pointing this out. We actually cited a more recent follow-up work to SASA by the same group of authors, HaloNet.
As stated in the paper, the idea of localizing attention is not a new idea, just as attention is not a new idea. Swin is also localizing attention (as others have too), but the difference is in the choice of receptive fields.
A key difference between NA and SASA is in the definition of neighborhoods.
NA is based on the concept of each pixel attending to its nearest neighbors, while SASA, following "same" Convolutions, is based on the concept of each pixel attending to its surrounding pixels only. Those two are very different at the edges, and the edges grow with window size (see our fig 6, or the animation in the README, for an illustration of how NA handles edges and corners).
In addition, neither SASA nor HaloNet were open-sourced and thus are difficult to directly compare to; and HaloNet’s Table 1 seems to suggest that SASA even has a different computational complexity and memory usage compared to NA. So there may be other differences that we are not aware of.

Another big difference between the papers is the application of NA vs SASA.
SASA aimed to replace spatial convolutions in existing models, typically all with small kernel sizes for models like ResNets, while our idea is to use large-neighborhood NAs to build efficient hierarchical transformers that work well for both image classification and downstream vision applications, similar to what Swin Transformer is doing, but simpler and more efficient. This is the reason why our Related section did not focus on works such as SASA and HaloNet, because while there are similarities in their attention mechanisms, the focus and application of the papers are very different. Our NAT directly competes with existing state-of-the-art hierarchical models such as Swin.

I hope that answers your question, and feel free to reopen the issues if you have any more questions.

commented

Okay, thanks for your reply!

Hello and thank you for your interest.

Thank you for pointing this out. We actually cited a more recent follow-up work to SASA by the same group of authors, HaloNet. As stated in the paper, the idea of localizing attention is not a new idea, just as attention is not a new idea. Swin is also localizing attention (as others have too), but the difference is in the choice of receptive fields. A key difference between NA and SASA is in the definition of neighborhoods. NA is based on the concept of each pixel attending to its nearest neighbors, while SASA, following "same" Convolutions, is based on the concept of each pixel attending to its surrounding pixels only. Those two are very different at the edges, and the edges grow with window size (see our fig 6, or the animation in the README, for an illustration of how NA handles edges and corners). In addition, neither SASA nor HaloNet were open-sourced and thus are difficult to directly compare to; and HaloNet’s Table 1 seems to suggest that SASA even has a different computational complexity and memory usage compared to NA. So there may be other differences that we are not aware of.

Another big difference between the papers is the application of NA vs SASA. SASA aimed to replace spatial convolutions in existing models, typically all with small kernel sizes for models like ResNets, while our idea is to use large-neighborhood NAs to build efficient hierarchical transformers that work well for both image classification and downstream vision applications, similar to what Swin Transformer is doing, but simpler and more efficient. This is the reason why our Related section did not focus on works such as SASA and HaloNet, because while there are similarities in their attention mechanisms, the focus and application of the papers are very different. Our NAT directly competes with existing state-of-the-art hierarchical models such as Swin.

I hope that answers your question, and feel free to reopen the issues if you have any more questions.

Thanks for the explanations. Is it possible to also provide a run-time comparison between NAT and Swin? It seems the current paper only compares the FLOPS which is not always coherent with the run-time.

@xuxy09 That is true, FLOPs are not a direct measure of time. They are though a measure of computational cost and we are particularly interested in that as the kernel is still not as fast as it can potentially be. As far as runtime goes, both training and inference on classification run with the same throughput as Swin at the Tiny scale, but they grow apart with NAT being slower than Swin. But again, that is only a limitation of the existing implementation, which we expect will change in the near future. You can also refer to this issue #13 for details.

First of all, thank you very much for your contribution to the community.
I also have some questions about how it differs from previous works.

A similar approach seems to be mentioned in Swin Transformer, and Swin's repo contains an implementation of this variant (sliding). The experimental results of NAT for this part of the ablation seem to be consistent with Swin, and the quantitative experimental results seem likely to be caused in part by a narrower but deeper network?

Snipaste_2022-04-26_09-27-53

I'm very much looking forward to your new CUDA implementation, as I tried a similar idea but gave up because of the speed and memory overhead.

@IDKiro Based on my reading, the sliding window approach seems to be more similar to SASA, than NA. We also observed that NAT-T is just as fast as Swin-T in inference on ImageNet, while their sliding window approach seems to be much slower. The Swin-T based result on ImageNet happens to be the same, 81.4%, but this is likely coincidental: as you can see in our ablation table, this gap grows as we shift to our NAT configuration (~0.5%). I'd also point out that our segmentation result on that model (not in the paper) was 46.3 mIoU, while the number in Swin's table is 45.8. We also did a detection run on that model, but using Mask RCNN (the table from Swin is Cascade Mask RCNN), and observed it performed on par with Swin (46.1 mAP vs Swin-T's 46.0), while this sliding window approach seems to be doing worse than Swin-T.