SHI-Labs / Neighborhood-Attention-Transformer

Neighborhood Attention Transformer, arxiv 2022 / CVPR 2023. Dilated Neighborhood Attention Transformer, arxiv 2022

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Comparison with zero-padding version.

weigq opened this issue · comments

commented

Excellent work!
BTW, the proposed edge/corner neighborghood selection has stronger performance than the zero padding version is claimed in the paper, i wonder about the performance of the latter one, which is not mentioned in the paper?

I am also interested in this claim, but did not find ablation study on that.

Hello, and thank you for your interest.

Generally we observed on-par/worse performance when using zero padding, and the gap increased as we scaled up, or moved towards downstream tasks.
I should also note that with zero padding, the module would no longer be as expressive as Swin's SWA, because of the reduced receptive field size. Additionally, with zero padding, the attention mechanism would not end up being equivalent to self-attention when the neighborhood size matches window size.
In other words, zero padding is just less expressive, and at best only saves a limited amount of compute, even with the CUDA kernel, which would be unnoticeable.

We may add our findings regarding the zero padding version in our supplementary materials in future releases.

I hope this helps.

commented

Thanks