hehefan / P4Transformer

Implementation of the "Point 4D Transformer Networks for Spatio-Temporal Modeling in Point Cloud Videos" paper.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

visualize tansformer's attention

weiyutao886 opened this issue · comments

I want to visualize tansformer's attention. I see that Fig4 in your paper visualizes it. Can you tell me where and how to visualize it? Can you share the visualization code? Thank you

Hi,

The visualization in the paper is generated by Mayavi. You can use it to visualize the self-attention map attn at https://github.com/hehefan/P4Transformer/blob/main/modules-pytorch-1.8.1/transformer.py#L60.

Best regards.

Thank you for your reply. Do you save atten's data during the training and then visualize it, or do you visualize it directly? In addition, you can refer to https://docs.enthought.com/mayavi/mayavi/auto/mlab_helper_functions.html#points3d Is this part of the code? This is a sequence of data. Do you need to process atten to separate a single model before visualization? I don't know much about visualization here. I'm sorry to bother you again

Hi,

Apologies for my late reply.

I saved the attn data during the evaluation. I used the points3d function of Mayavi to visualize each frame. Also, note that you need to save the position of each query area.

Best.

I saw that the dimension of the Attn is [bitchsize, head, cl, cl], bitchsize is the batch, head is the number of transfomer headers, and c*l is the product of the number of frames and points. You also mentioned in the paper, but the Attn has no point position. How can I visualize the shape of the person in your paper according to the Attn, that is to say, visualizing the Attn is simply visualizing its weight, Moreover, the dimension of Attn without point position cannot be visualized. I'm sorry to trouble you again.

Hi,

Point features and attentions are associated with point positions/coordinates in point-based methods. It is allowed to manually associate them by modifying the code.

In [bitchsize, head, c1, c2], c1 is the query and c2 is the attention.

Best.

What you mean is that C1 represents the query point Q, and C2 is the corresponding C2 weight parameters of each query point Q, so I can take out the C2 weight and assign it on the point cloud coordinates to realize visualization, right. I do this at present, but there are only 128 points in the point cloud of each frame. The visualization effect is not good. Do you have any suggestions.

I read your code, The data you input into transformer is ([14, 12, 1024, 64], that is, 12 frames, 64 points in each frame, and 12 * 64 points in Attn. In this way, there are 64 points in each frame. If I visualize each frame, only 64 points have weights, and the visualized point cloud and its weights of these 64 points are also. However, I see that the visualized graph in your paper is composed of many points. How do you handle it

At present, the visualization I understand is to assign the point cloud of each frame to its corresponding weight on Attn. However, there are only 64 points in each frame. This visualization seems to have no effect. I hope you can point out my problem

Hi, you need to upsample points via the feature propagation operation in PointNet++.

Thank you for your patient reply. Do you mean that I need to add a pointnet + + before the transformer to upsample the visual results similar to those in your paper? I have another question. How many point clouds do I need to upsample? Too many points will cause CUDA out of memory problems

Nope.

First, when I made the visualization, I saved the input point clouds and the corresponding subsampled self-attention weights. Because the input is 2048 points, the visualization is of 2048 points.

Second, what the feature propagation operation does is to interpolate back the subsampled weights to the original input points based on distance. Suppose a is an original point, and b and c are subsampled points with attentions B and C, respectively. The a's attention will be B/||b-a|| + C/||c-a||.

for msr model i do this:

def forward(self, input): # [B, L, N, 3]
device = input.get_device()
# print("input=",input.shape)
###################################saveinput
input2 = input.cpu().detach()

input2 = input2.reshape(input2.shape[0] * input2.shape[2] * input2.shape[3], input2.shape[1])[:, :3]
print('input111=', input2.shape)
np.savetxt(r"/root/autodl-tmp/result1/result1.txt", input2)


xyzs, features = self.tube_embedding(input)                                                                                         # [B, L, n, 3], [B, L, C, n]
################################savexyzs
input3 = xyzs.cpu().detach()
input3 = input3.reshape(input3.shape[0] * input3.shape[2] * input3.shape[3], input3.shape[1])[:, :3]
print('input111=', input3.shape)
np.savetxt(r"/root/autodl-tmp/result1/result2.txt", input3)

# print("fea00=", features.shape)
# print("xyz00=", xyzs.shape)

for transformer i do this:

dots = einsum('b h i d, b h j d -> b h i j', q, k) * self.scale

attn = dots.softmax(dim=-1)
# print("attn=",attn.shape)
attn1 = attn.cpu().detach()
attn2 = attn1[:1, :1, :, :]
attn2 = attn2.reshape(attn2.shape[0] * attn2.shape[1] * attn2.shape[2], attn2.shape[3])
result2 = np.array(attn2)
np.savetxt(
    '/root/autodl-tmp/result1/attnresult.txt',
    result2)

The first question is whether the way I save data in this way is correct. The second question is that I assign the obtained Attn weight to the point cloud of XYZs. I think the points of XYZs contain the coordinates of the points in Attn and correspond to each other. But you said that Attn should be assigned to the point cloud of 2048 points in the input. However, there are 2048 point clouds in the input, and only 64 point clouds in the Attn. Is this OK? Do you mean to directly merge the two? I'm very interested in your point cloud sequence research and there are still problems in visualization. Thank you for your patience

Hi,

It is not so complicated.

Suppose P with shape N x 3 is the input point cloud and Q with shape M x 3 is the downsampled points, where M < N. The downsampled points are with attention weights A with shape M x 1. Because there are multiple Transformer layers, you may select an intermediate layer. Also, because there are multiple heads, you need to select one head.

Then, what you need only to do is to transfer the attention weights A to the input point cloud based on P and Q. To do so, I show you a very simple code,

                dist = np.expand_dims(P, 1) - np.expand_dims(Q, 0)
                dist = np.sum(dist*dist, -1)
                idx = np.argmin(dist, 1)
                attn = A[idx]

The attn is exactly what you want. You may also use the feature propagation operation to do this.

Thank you very much for the code you provided. I can initially visualize Attn. At present, there is a problem that I don't understand. The number of input frames is 24. Because the method you mentioned here is based on the temporal stripe, the number of frames of the sampled point is 12. How can I better match the number of input frames with the number of frames of the down sampling point? For example, the first frame of the input point corresponds to the first frame of the down sampling point. What about the second frame and the third frame of the input point?

Here, I save the input [14,24,1024,3] as a file [N, 3], where each bitchsize corresponds to 24 frames, the sampling point [14,12,64,3] is saved as [M, 3], where each bitchsize corresponds to 12 frames. How can I match the input of each frame with the down sampling point when I select it in the visualization
Here are some results of my visualization. What are the reasons for the gap between the results and yours
image
image
image
thanks for your help

Hi,

It is not so complicated.

Suppose P with shape N x 3 is the input point cloud and Q with shape M x 3 is the downsampled points, where M < N. The downsampled points are with attention weights A with shape M x 1. Because there are multiple Transformer layers, you may select an intermediate layer. Also, because there are multiple heads, you need to select one head.

Then, what you need only to do is to transfer the attention weights A to the input point cloud based on P and Q. To do so, I show you a very simple code,

                dist = np.expand_dims(P, 1) - np.expand_dims(Q, 0)
                dist = np.sum(dist*dist, -1)
                idx = np.argmin(dist, 1)
                attn = A[idx]

The attn is exactly what you want. You may also use the feature propagation operation to do this.

Can you provide a detailed explanation of the process by which the attention weight becomes [M, 1] in size?
I'm curious about how to obtain a single weight from the softmax matrix, which initially had a size of [frame_length * token per frame, frame_length * token per frame].

any new progress about how to visualize the attention?

I'm not sure if the weight described by the authors is obtained in this way, but I averaged over the dimension in the attention matrix where softmax was not applied to get the (M, 1) weight they mentioned, and the visualization result was quite meaningful. However, I used a different dataset and task, only adopting the 4D conv structure.

thanks