Why the video does not pass through the encoder?

Question

Why the video does not pass through the encoder?

Wang-Xiaodong1899 opened this issue 2 years ago · comments

Hi! lucidrains. Thanks for providing a great repo which is convenient to understand the NUWA paper.
I have a question as follows:
In the NUWA paper, we can see that the inputs of the Encoder are caption tokens (caption condition) and the video tokens (3DNA condition). So, in my eye, the video tokens sequence should fully self-attend in the Encoder, right? And then, the outputs condition the Decoder.
The Decoder provided by you is as following.
.
It has causal self-attention and text-condition as we expected. But from the definition in paper, the condition contains the text-condition and 3DNA condition, and these two condition the Decoder. Is my opinion right? I am just curious about the condition in the NUWA paper.
The Encoder in your repo is only the Text-Encoder, but the video does not pass through the encoder to condition the Encoder.

Looking forward to your reply! Thanks!