About the size of neighborhood

Question

About the size of neighborhood

wangning7149 opened this issue 2 years ago · comments

Hi
a neighborhood of size L × L ，Is L here equal to 3？

qwopqwop200 · Answer 1 · Sun May 08 2022 15:02:32 GMT+0800 (China Standard Time)

According to the paper, the overall setup follows Swin, where Swin has an L size of 7 and NAT is the same.
https://github.com/SHI-Labs/Neighborhood-Attention-Transformer/blob/main/classification/nat.py#L259

wangning7149 · Answer 2 · Sun May 08 2022 18:52:53 GMT+0800 (China Standard Time)

Isn't NAT pixel by pixel? So why is it lower than the flops of swin? 发自我的iPhone

…

------------------ Original ------------------ From: qwopqwop200 ***@***.***> Date: Sun,May 8,2022 3:02 PM To: SHI-Labs/Neighborhood-Attention-Transformer ***@***.***> Cc: wangning7149 ***@***.***>, Author ***@***.***> Subject: Re: [SHI-Labs/Neighborhood-Attention-Transformer] About the size of neighborhood (Issue #27) The size of L for this NAT is 7, same as Swin. https://github.com/SHI-Labs/Neighborhood-Attention-Transformer/blob/main/classification/nat.py#L259 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

Ali Hassani · Answer 3 · Mon May 09 2022 04:05:55 GMT+0800 (China Standard Time)

Hello and thank you for your interest.

Firstly, L x L is the term we use to denote kernel (window) size in the paper. Neighborhood size would technically be half the window size, because in theory, each query has L // 2 neighbors on each side of it across each axis, thus L // 2 * 2 neighbors plus itself yields L total pixels across each axis. That's actually why we force kernel size to be specifically odd numbers, so that query pixels can be centered.

We followed Swin in setting the window size to 7x7 so that both end up having the same sized receptive fields. In other words, in every attention module, both NA and SWSA limit each query to exactly 7x7 keys and values.

As for the models, we used a new configuration that is different from Swin. We firstly found overlapping convolutions to be more effective than patched convolutions for both tokenization and downsampling. We also found that with slightly deeper models (but with thinner inverted bottlenecks), we can achieve even better performance.
That's why our final models end up with fewer FLOPs than their Swin counterparts.

We've done an ablation study on these changes, which is presented in the paper.

I hope this answers both of your questions.

Ali Hassani · Answer 4 · Fri May 20 2022 23:53:55 GMT+0800 (China Standard Time)

Closing this due to inactivity. If you still have questions feel free to open it back up.