Clarification Regarding Visual embeddings

Question

Clarification Regarding Visual embeddings

Shaurya026 opened this issue 8 months ago · comments

Hi, thanks for this new techniques for extending MLLMs for interleaved documents.

I had a doubt regarding visual encoder in fig-2 and section 3.

In figure 2, as I understand that "during training" itself the model is learning to generate <dream> tokens as the cat example is given for which dream queries are learned which are essentially textual inversion embeddings that can be synergised with the remaining context of the textual tokens. (it shows inference stream, but interleaved doc shows text with cat image so it confuses me a bit)
but as that happens, are we again sending the input image to the models via CLIP encodings via an extra projection encodings as shown in the diagram. (is that right? or we're just using the dream queries further ahead)

Also, for taking visual inputs are we following a similar pipeline of CLIP embeddings with projection like that of EMU?

Thank you

Runpei Dong · Answer 1 · Fri Oct 06 2023 22:32:15 GMT+0800 (China Standard Time)

Hi @Shaurya026,

Thanks for your question.

Yes. All generated images are sent back to the model for the following token prediction. This is important since only when it is fed back, the real interleaved distribution is modeled.
The visual input encoding is similar to Emu. This practice was first introduced by Kosmos-1 (as far as I know), where only a linear projection layer is used as the visual connector, followed by many works, including Emu. However, the visual encoder may be different such as CLIP-G or CLIP-L. Btw, some other works use more complicated architectures, such as a Q-former or Perceiver architecture as the visual connector (projection).

Shaurya026 · Answer 2 · Wed Oct 18 2023 00:49:46 GMT+0800 (China Standard Time)

Thanks, that helps! 👍🏻