visualize tansformer's attention

Question

visualize tansformer's attention

weiyutao886 opened this issue 2 years ago · comments

I want to visualize tansformer's attention. I see that Fig4 in your paper visualizes it. Can you tell me where and how to visualize it? Can you share the visualization code? Thank you

whu-lyh commented a month ago

thanks

Hehe Fan · Answer 1 · Tue Jun 14 2022 23:49:30 GMT+0800 (China Standard Time)

Hi,

The visualization in the paper is generated by Mayavi. You can use it to visualize the self-attention map attn at https://github.com/hehefan/P4Transformer/blob/main/modules-pytorch-1.8.1/transformer.py#L60.

Best regards.

weiyutao886 · Answer 2 · Thu Jun 16 2022 23:54:31 GMT+0800 (China Standard Time)

Thank you for your reply. Do you save atten's data during the training and then visualize it, or do you visualize it directly? In addition, you can refer to https://docs.enthought.com/mayavi/mayavi/auto/mlab_helper_functions.html#points3d Is this part of the code? This is a sequence of data. Do you need to process atten to separate a single model before visualization? I don't know much about visualization here. I'm sorry to bother you again

Hehe Fan · Answer 3 · Tue Jun 21 2022 15:16:28 GMT+0800 (China Standard Time)

Hi,

Apologies for my late reply.

I saved the attn data during the evaluation. I used the points3d function of Mayavi to visualize each frame. Also, note that you need to save the position of each query area.

Best.

weiyutao886 · Answer 4 · Tue Jun 21 2022 17:46:34 GMT+0800 (China Standard Time)

I saw that the dimension of the Attn is [bitchsize, head, cl, cl], bitchsize is the batch, head is the number of transfomer headers, and c*l is the product of the number of frames and points. You also mentioned in the paper, but the Attn has no point position. How can I visualize the shape of the person in your paper according to the Attn, that is to say, visualizing the Attn is simply visualizing its weight, Moreover, the dimension of Attn without point position cannot be visualized. I'm sorry to trouble you again.

Hehe Fan · Answer 5 · Mon Jun 27 2022 12:36:10 GMT+0800 (China Standard Time)

Hi,

Point features and attentions are associated with point positions/coordinates in point-based methods. It is allowed to manually associate them by modifying the code.

In [bitchsize, head, c1, c2], c1 is the query and c2 is the attention.

Best.

weiyutao886 · Answer 6 · Fri Aug 26 2022 17:04:40 GMT+0800 (China Standard Time)

What you mean is that C1 represents the query point Q, and C2 is the corresponding C2 weight parameters of each query point Q, so I can take out the C2 weight and assign it on the point cloud coordinates to realize visualization, right. I do this at present, but there are only 128 points in the point cloud of each frame. The visualization effect is not good. Do you have any suggestions.

weiyutao886 · Answer 7 · Mon Aug 29 2022 15:59:34 GMT+0800 (China Standard Time)

I read your code, The data you input into transformer is ([14, 12, 1024, 64], that is, 12 frames, 64 points in each frame, and 12 * 64 points in Attn. In this way, there are 64 points in each frame. If I visualize each frame, only 64 points have weights, and the visualized point cloud and its weights of these 64 points are also. However, I see that the visualized graph in your paper is composed of many points. How do you handle it

weiyutao886 · Answer 8 · Mon Aug 29 2022 16:05:34 GMT+0800 (China Standard Time)

At present, the visualization I understand is to assign the point cloud of each frame to its corresponding weight on Attn. However, there are only 64 points in each frame. This visualization seems to have no effect. I hope you can point out my problem

Hehe Fan · Answer 9 · Mon Aug 29 2022 20:28:06 GMT+0800 (China Standard Time)

Hi, you need to upsample points via the feature propagation operation in PointNet++.

weiyutao886 · Answer 10 · Mon Aug 29 2022 22:58:19 GMT+0800 (China Standard Time)

Thank you for your patient reply. Do you mean that I need to add a pointnet + + before the transformer to upsample the visual results similar to those in your paper? I have another question. How many point clouds do I need to upsample? Too many points will cause CUDA out of memory problems

Hehe Fan · Answer 11 · Tue Aug 30 2022 00:17:15 GMT+0800 (China Standard Time)

Nope.

First, when I made the visualization, I saved the input point clouds and the corresponding subsampled self-attention weights. Because the input is 2048 points, the visualization is of 2048 points.

Second, what the feature propagation operation does is to interpolate back the subsampled weights to the original input points based on distance. Suppose a is an original point, and b and c are subsampled points with attentions B and C, respectively. The a's attention will be B/||b-a|| + C/||c-a||.

weiyutao886 · Answer 12 · Tue Aug 30 2022 20:40:03 GMT+0800 (China Standard Time)

for msr model i do this:

def forward(self, input): # [B, L, N, 3]
device = input.get_device()
# print("input=",input.shape)
###################################saveinput
input2 = input.cpu().detach()

input2 = input2.reshape(input2.shape[0] * input2.shape[2] * input2.shape[3], input2.shape[1])[:, :3]
print('input111=', input2.shape)
np.savetxt(r"/root/autodl-tmp/result1/result1.txt", input2)


xyzs, features = self.tube_embedding(input)                                                                                         # [B, L, n, 3], [B, L, C, n]
################################savexyzs
input3 = xyzs.cpu().detach()
input3 = input3.reshape(input3.shape[0] * input3.shape[2] * input3.shape[3], input3.shape[1])[:, :3]
print('input111=', input3.shape)
np.savetxt(r"/root/autodl-tmp/result1/result2.txt", input3)

# print("fea00=", features.shape)
# print("xyz00=", xyzs.shape)

for transformer i do this:

dots = einsum('b h i d, b h j d -> b h i j', q, k) * self.scale

attn = dots.softmax(dim=-1)
# print("attn=",attn.shape)
attn1 = attn.cpu().detach()
attn2 = attn1[:1, :1, :, :]
attn2 = attn2.reshape(attn2.shape[0] * attn2.shape[1] * attn2.shape[2], attn2.shape[3])
result2 = np.array(attn2)
np.savetxt(
    '/root/autodl-tmp/result1/attnresult.txt',
    result2)

The first question is whether the way I save data in this way is correct. The second question is that I assign the obtained Attn weight to the point cloud of XYZs. I think the points of XYZs contain the coordinates of the points in Attn and correspond to each other. But you said that Attn should be assigned to the point cloud of 2048 points in the input. However, there are 2048 point clouds in the input, and only 64 point clouds in the Attn. Is this OK? Do you mean to directly merge the two? I'm very interested in your point cloud sequence research and there are still problems in visualization. Thank you for your patience

Hehe Fan · Answer 13 · Wed Aug 31 2022 19:10:18 GMT+0800 (China Standard Time)

Hi,

It is not so complicated.

Suppose P with shape N x 3 is the input point cloud and Q with shape M x 3 is the downsampled points, where M < N. The downsampled points are with attention weights A with shape M x 1. Because there are multiple Transformer layers, you may select an intermediate layer. Also, because there are multiple heads, you need to select one head.

Then, what you need only to do is to transfer the attention weights A to the input point cloud based on P and Q. To do so, I show you a very simple code,

                dist = np.expand_dims(P, 1) - np.expand_dims(Q, 0)
                dist = np.sum(dist*dist, -1)
                idx = np.argmin(dist, 1)
                attn = A[idx]

The attn is exactly what you want. You may also use the feature propagation operation to do this.

weiyutao886 · Answer 14 · Thu Sep 01 2022 23:52:38 GMT+0800 (China Standard Time)

Thank you very much for the code you provided. I can initially visualize Attn. At present, there is a problem that I don't understand. The number of input frames is 24. Because the method you mentioned here is based on the temporal stripe, the number of frames of the sampled point is 12. How can I better match the number of input frames with the number of frames of the down sampling point? For example, the first frame of the input point corresponds to the first frame of the down sampling point. What about the second frame and the third frame of the input point?

weiyutao886 · Answer 15 · Sun Sep 04 2022 09:45:06 GMT+0800 (China Standard Time)

Here, I save the input [14,24,1024,3] as a file [N, 3], where each bitchsize corresponds to 24 frames, the sampling point [14,12,64,3] is saved as [M, 3], where each bitchsize corresponds to 12 frames. How can I match the input of each frame with the down sampling point when I select it in the visualization
Here are some results of my visualization. What are the reasons for the gap between the results and yours

thanks for your help

create7859 · Answer 16 · Tue Jan 30 2024 21:37:24 GMT+0800 (China Standard Time)

Hi,

It is not so complicated.

Suppose P with shape N x 3 is the input point cloud and Q with shape M x 3 is the downsampled points, where M < N. The downsampled points are with attention weights A with shape M x 1. Because there are multiple Transformer layers, you may select an intermediate layer. Also, because there are multiple heads, you need to select one head.

Then, what you need only to do is to transfer the attention weights A to the input point cloud based on P and Q. To do so, I show you a very simple code,
                dist = np.expand_dims(P, 1) - np.expand_dims(Q, 0)
                dist = np.sum(dist*dist, -1)
                idx = np.argmin(dist, 1)
                attn = A[idx]
The attn is exactly what you want. You may also use the feature propagation operation to do this.

Can you provide a detailed explanation of the process by which the attention weight becomes [M, 1] in size?
I'm curious about how to obtain a single weight from the softmax matrix, which initially had a size of [frame_length * token per frame, frame_length * token per frame].

whu-lyh · Answer 17 · Sun May 26 2024 15:30:44 GMT+0800 (China Standard Time)

any new progress about how to visualize the attention?

create7859 · Answer 18 · Mon May 27 2024 04:12:40 GMT+0800 (China Standard Time)

I'm not sure if the weight described by the authors is obtained in this way, but I averaged over the dimension in the attention matrix where softmax was not applied to get the (M, 1) weight they mentioned, and the visualization result was quite meaningful. However, I used a different dataset and task, only adopting the 4D conv structure.