SHI-Labs / Neighborhood-Attention-Transformer

Neighborhood Attention Transformer, arxiv 2022 / CVPR 2023. Dilated Neighborhood Attention Transformer, arxiv 2022

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

About the size of neighborhood

wangning7149 opened this issue · comments

Hi
a neighborhood of size L × L ,Is L here equal to 3?

According to the paper, the overall setup follows Swin, where Swin has an L size of 7 and NAT is the same.
https://github.com/SHI-Labs/Neighborhood-Attention-Transformer/blob/main/classification/nat.py#L259

Hello and thank you for your interest.

Firstly, L x L is the term we use to denote kernel (window) size in the paper. Neighborhood size would technically be half the window size, because in theory, each query has L // 2 neighbors on each side of it across each axis, thus L // 2 * 2 neighbors plus itself yields L total pixels across each axis. That's actually why we force kernel size to be specifically odd numbers, so that query pixels can be centered.

We followed Swin in setting the window size to 7x7 so that both end up having the same sized receptive fields. In other words, in every attention module, both NA and SWSA limit each query to exactly 7x7 keys and values.

As for the models, we used a new configuration that is different from Swin. We firstly found overlapping convolutions to be more effective than patched convolutions for both tokenization and downsampling. We also found that with slightly deeper models (but with thinner inverted bottlenecks), we can achieve even better performance.
That's why our final models end up with fewer FLOPs than their Swin counterparts.

We've done an ablation study on these changes, which is presented in the paper.

I hope this answers both of your questions.

Closing this due to inactivity. If you still have questions feel free to open it back up.