Vchitect / Latte

I found that the VAE merges the frame dimension with the batch dimension, which means there is no interaction between frames when encoding video latents. It works equivalently to image VAE, which is not in line with section 3.3.1 of the paper.

Latte/train.py

Line 207 in c456dff

x = rearrange(x, 'b f c h w -> (b f) c h w').contiguous()

Is it because subsequent experiments have found that frame-to-frame interactions do not enhance video generation?

I found that the VAE merges the frame dimension with the batch dimension, which means there is no interaction between frames when encoding video latents. It works equivalently to image VAE, which is not in line with section 3.3.1 of the paper.

Latte/train.py

Line 207 in c456dff

x = rearrange(x, 'b f c h w -> (b f) c h w').contiguous()

Is it because subsequent experiments have found that frame-to-frame interactions do not enhance video generation?

Hi, thanks for your interest. What is referred to in Section 3.3.1 is not the compression of video in the temporal dimension at the vae encoder stage. Instead, it refers to compression in the temporal dimension on the latents of the video frames.

About video VAE