SHI-Labs / Neighborhood-Attention-Transformer

Neighborhood Attention Transformer, arxiv 2022 / CVPR 2023. Dilated Neighborhood Attention Transformer, arxiv 2022

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to visualize the attention map?

Amo5 opened this issue · comments

Hi,
I use command 'pip3 install natten -f https://shi-labs.com/natten/wheels/cu116/torch1.12.1/index.html' to install the wheel.
But, I don't know how to visualize the attention map of NeighborhoodAttention2D.
Could you help me?

Hello and thank you for your interest in our work.
First off I'm sorry for getting to this question so late.

Unfortunately methods that restrict self attention to small windows cannot produce attention maps in the same way that self attention itself does. This applies to sliding window approaches (NA/DiNA, SASA, Sliding Window Attention) and partitioning-based methods (block attention, WSA, and the like).

There's two reasons for that, the most important of which is the fact that the full self attention graph is not learned during training. This means that every pixel attends to a subset of the input as opposed to the entire set -- therefore every pixel only produces a fixed number of attention weights. In other words, given a 64x64 input feature map, you would end up with attention maps of shape 7x7 for every pixel, whereas if you were computing self attention, you'd have attention maps of shape 64x64 for every pixel (still hard to visualize because there's 4096 pixels, so in total 4096x64x64 attention weights, but it's easy to either map those to a single attention map, or cross attend something with every pixel to produce one attention map.)

Methods that do not restrict attention (ViT / DeiT) typically also learn a "class token", and use it to produce attention maps at different layers -- class token attends to every pixel in your feature map (and itself depending on the model), therefore given any input image, the token can cross-attend the pixels in the same way (this is the "something" I mentioned earlier).

I'm sorry, I meant to get to this sooner. I have code to visualize the attention maps for both Swin and NAT located here.

If you use these attention maps in your work, please cite StyleNAT as that's where this is introduced.

Here is a sample of what the maps may look like. Note that in StyleNAT we are using a Hydra-NA, which allows for different dilation and/or kernels on each attention head, so we look at them independently. You can either mean or sum the heads if you want. Also note that this example is from a generative model, so your maps would look different in a discriminating type of network.

There are a lot more samples in the appendix of StyleNAT, including ones from Swin. There will be some visual differences between these because the attentions have different types of biases. We extensively discuss this in StyleNAT too.

image