lucidrains / nuwa-pytorch

Implementation of NÜWA, state of the art attention network for text to video synthesis, in Pytorch

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Why the video does not pass through the encoder?

Wang-Xiaodong1899 opened this issue · comments

Hi! lucidrains. Thanks for providing a great repo which is convenient to understand the NUWA paper.
I have a question as follows:
In the NUWA paper, we can see that the inputs of the Encoder are caption tokens (caption condition) and the video tokens (3DNA condition). So, in my eye, the video tokens sequence should fully self-attend in the Encoder, right? And then, the outputs condition the Decoder.
The Decoder provided by you is as following.
截屏2022-05-12 上午11 07 12.
It has causal self-attention and text-condition as we expected. But from the definition in paper, the condition contains the text-condition and 3DNA condition, and these two condition the Decoder. Is my opinion right? I am just curious about the condition in the NUWA paper.
The Encoder in your repo is only the Text-Encoder, but the video does not pass through the encoder to condition the Encoder.

Looking forward to your reply! Thanks!