Question about the generation of conditional embeddings

Question

Question about the generation of conditional embeddings

PeihaoChen opened this issue 7 months ago · comments

Hi, thanks for your great work. I have some questions when I read the paper.

The generation of conditional embeddings is shown in Equation (3). The learnable dream tokens, together with the interleaved document sequence so far x and the generated image so far V, are fed into a cross-attention model to figure out conditional embedding. For this process, I have several questions in detail:

What is the architecture of the cross-attention model? How deep is it? Is it randomly initialized?
For the input of this cross-attention modal, do you input the raw text features before the LLM? or the last hidden state of the LLM? How about the visual features V? Do you input the visual features outputted from the visual encoder and the visual projection?

Thanks!

Runpei Dong · Answer 1 · Fri Mar 22 2024 14:25:51 GMT+0800 (China Standard Time)

Hi @PeihaoChen,

The cross-attention is not a new model. It is the original StableDiffusion cross-attention in the UNet.
This cross-attention only receives the dream queries and the image letents, similar to the original Diffusion UNet.
The visual features are projected after the visual encoding.
The code is released. More details can be found in the code.

We have released a new toolkit, `Omni, ' which can be used to develop multimodal LLMs.