THUDM / CogVideo

text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

About 3D Swin Attention

lemon-prog123 opened this issue · comments

In your description about the dual channel attention, you add the attention-base's and attention-plus's patches in the end. But as the orginal 3D Swin Attention, videos are divided into 3D patches, which is not suitable to add to 2D patches. Did you just divided frames into 2D patches and used the 3D Swin Attention Method?

Hi, different attention channels are calculated independently, and are added up later in the unit of tokens instead of patches. As mentioned in sec 3.2 in our paper, the temporal channel (attention-plus)'s window size is (A_x, A_y, T_s), therefore we adopt 3D swin attention; the spatial channel (attention-base)'s window size is (X, Y, 1), therefore we adopt 2D attention in each frame in parallel.