Question about Figure 5

Question

Question about Figure 5

techmonsterwang opened this issue 2 years ago · comments

Jiahao Wang commented 2 years ago

Hi, Peihao!

This is an amazing paper with fantastic visualization results.

I have 2 questions about Visualize the spectrum of attention maps (Figure 5), that is:

Is the model on which the plot in Figure 5 is based the pre-trained DeiT-S? I plotted some plots based on the visualization code you provided and the pre-trained weights open to DeiT, but they don't seem to match Figure 5.
Is the matrix on which Figure 5 is plotted an attention map after softmax? Or is it from another location?

Thanks for replying!

Peter · Answer 1 · Tue Jun 14 2022 09:21:46 GMT+0800 (China Standard Time)

Dear Jiahao,

Thanks for your interest.

Yes, I guess you can get the similar visualization of the spectrum using pre-trained DeiT-S if it is computed correctly.
The Figure 5 plots the post-softmax attention matrix.

Best,
Peihao

Jiahao Wang · Answer 2 · Tue Jun 14 2022 23:14:34 GMT+0800 (China Standard Time)

Hi, Peihao!

Thanks for the nice reply.

I have plot the post-softmax attention matrix using the open-source DeiT-S pretrained model and get the results. But I found that the results in layer 5 has been shown like this:

I also found that attention matrix after layer 2 shows a similar pattern, which are not similar with Figure.5.

Is it possible that I am wrong somewhere? Can you provide the full code for this part please? Thanks a lot.

Jiahao Wang · Answer 3 · Tue Jun 14 2022 23:57:05 GMT+0800 (China Standard Time)

Besides, I plot the attention matrix through code as:

class Attention(nn.Module):
def init(self, dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0., proj_drop=0.):
super().init()
self.num_heads = num_heads
head_dim = dim // num_heads
# NOTE scale factor was wrong in my original version, can set manually to be compat with prev weights
self.scale = qk_scale or head_dim ** -0.5

    self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
    self.attn_drop = nn.Dropout(attn_drop)
    self.proj = nn.Linear(dim, dim)
    self.proj_drop = nn.Dropout(proj_drop)

def forward(self, x):
    B, N, C = x.shape
    qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
    q, k, v = qkv[0], qkv[1], qkv[2]   # make torchscript happy (cannot use tensor as tuple)

    attn = (q @ k.transpose(-2, -1)) * self.scale
    attn = attn.softmax(dim=-1)
    attn = self.attn_drop(attn)

    x = (attn @ v).transpose(1, 2).reshape(B, N, C)
    x = self.proj(x)
    x = self.proj_drop(x)
    return x, attn

Am I doing the same as you? Thanks for the reply~

Peter · Answer 4 · Wed Jun 15 2022 00:26:33 GMT+0800 (China Standard Time)

Hi Jiahao,

Thanks for the follow-up. I guess we are using the same code for computing attention. I do think the figure you showed above has the right shape as we expect - a peak around the DC component and a long tail across the high-frequency bands. However, I'm not very convinced that the frequencies other than DC have zero responses. This is not possible, otherwise the attention map will turn out to be a (normalized) all-one matrix. I would suggest you first visualize the attention map directly, and try with different batches of images.

Best,
Peihao

Jiahao Wang · Answer 5 · Wed Jun 15 2022 00:51:26 GMT+0800 (China Standard Time)

Hi Peihao,

Thanks for the quick and nice reply!
I saw the paper say that "Below we provide a complete spectral visualization of attention maps computed from a random sample in ImageNet validation set", which mean that Figure.5 was drawn from only 1 sample.

Do you mean that figure 5 is obtained from the data of a batch validation data? If so, how big is the data in this batch?

Jiahao Wang · Answer 6 · Wed Jun 15 2022 01:01:36 GMT+0800 (China Standard Time)

Hi Peihao,

Yes, as you say, the attention matrix after layer 2 indeed turns out to be a (normalized) all-one matrix like:

tensor([[0.0051, 0.0051, 0.0051, ..., 0.0051, 0.0051, 0.0051],
[0.0051, 0.0051, 0.0051, ..., 0.0051, 0.0051, 0.0051],
[0.0051, 0.0051, 0.0051, ..., 0.0051, 0.0051, 0.0051],
...,
[0.0051, 0.0051, 0.0051, ..., 0.0051, 0.0051, 0.0051],
[0.0051, 0.0051, 0.0051, ..., 0.0051, 0.0051, 0.0051],
[0.0051, 0.0051, 0.0051, ..., 0.0051, 0.0051, 0.0051]],
grad_fn=)

I wonder why it be like this? since I directly use the pretrained open-sourced DeiT-S checkpoint.

Jiahao Wang · Answer 7 · Thu Jun 16 2022 23:15:43 GMT+0800 (China Standard Time)

Hi Peihao,

Thanks for the quick and nice reply!
I saw the paper say that "Below we provide a complete spectral visualization of attention maps computed from a random sample in ImageNet validation set", which mean that Figure.5 was drawn from only 1 sample.

Do you mean that figure 5 is obtained from the data of a batch validation data? If so, how big is the data in this batch?

Peter · Answer 8 · Sat Jun 25 2022 09:21:43 GMT+0800 (China Standard Time)

Hi Jiahao,

Thanks for your continued interest and sorry for the late reply. Getting uniformly distributed attention maps does not make sense to me. Have you ever tested the test accuracy?

Fig. 5 is visualized for one sample (i.e., one image) not a batch. As far as I remember, this was not cherry picky. An arbitrary sample can produce the similar results.

Peihao

gcxamy · Answer 9 · Tue Sep 20 2022 19:50:30 GMT+0800 (China Standard Time)

Hello, I really love your paper. Both the results and visualization are incredible. I am very curious about how you visualize Figure 5, could I have your code about it?

Peter · Answer 10 · Wed Sep 21 2022 04:13:30 GMT+0800 (China Standard Time)

Hi, thanks for your interest. The code to visualize the spectrum of attention can be found here: #1 (comment). Hope you find this helpful!

Peter · Answer 11 · Wed Sep 21 2022 04:14:27 GMT+0800 (China Standard Time)

Feel free to reopen this thread if any further issues.