About 3D Swin Attention

Question

About 3D Swin Attention

lemon-prog123 opened this issue 2 years ago · comments

In your description about the dual channel attention, you add the attention-base's and attention-plus's patches in the end. But as the orginal 3D Swin Attention, videos are divided into 3D patches, which is not suitable to add to 2D patches. Did you just divided frames into 2D patches and used the 3D Swin Attention Method?

Wenyi Hong · Answer 1 · Thu Aug 11 2022 17:27:47 GMT+0800 (China Standard Time)

Hi, different attention channels are calculated independently, and are added up later in the unit of tokens instead of patches. As mentioned in sec 3.2 in our paper, the temporal channel (attention-plus)'s window size is (A_x, A_y, T_s), therefore we adopt 3D swin attention; the spatial channel (attention-base)'s window size is (X, Y, 1), therefore we adopt 2D attention in each frame in parallel.