RunpeiDong / DreamLLM

[ICLR 2024 Spotlight] DreamLLM: Synergistic Multimodal Comprehension and Creation

Home Page:https://dreamllm.github.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

What did the <dream> token learn exactly?

StupidDebugger opened this issue · comments

Thanks a lot for opening source your complete code of DreamLLM!

As I read your paper and some parts of your code, there's one question I would like to consult with you: What did the token learn exactly? It seems you present explicit analysis for what's learned by dream queries, but this question flashed in my mind also intrigues me and I failed to find any statement about it. The reason why I focus on the token is that it can tell the MLLM when to generate a new image, which means it can understand what have been generated by MLLM to some extent (In other words, it may contain semantic information about intent or sth. other). So I want to know how you think about this question.

Thank you again for taking time to read my issue. Hoping for your early reply!

Hi @StupidDebugger,

Thank you for your interest in our work! This is a good and interesting question. I think I agree with you that it may contain semantic information about intent or sth.

  • Technically, the dream queries are used to extract conditioning semantics through the causal attention of the LLM. So, they should be trained to understand multimodal causal context semantics. Multimodal causality forms the basis of the learning process. Since the dream queries are basically used for image posterior modeling, the multimodal causality largely relies on the dream queries. However, during this process, the LLMs learn to understand multimodal causality as well since it is an end-to-end process.
  • As I presented in the paper, I have called this dream query-based image generation as diffusion distillation. The reason is that the diffusion decoder encoded multimodal semantics (distributions) are distilled through the dream queries during this process. Note that the SD can also be unfrozen, and sometimes, it leads to better performance, especially when you want the model to obtain better image-to-image generation capabilities.
  • From another perspective, the dream queries are latent representations, right? So, I believe they are the dark, learnable representations that implicitly carry the world knowledge, pretty much like a prototype of the latent world model.

I hope this answer helps you, and I welcome any further discussions.

Hi @StupidDebugger,

I just found that you are talking about the token instead of the dream queries. For the token. Yes, I agree with your opinion.

Hi @StupidDebugger,

Thank you for your interest in our work! This is a good and interesting question. I think I agree with you that it may contain semantic information about intent or sth.

  • Technically, the dream queries are used to extract conditioning semantics through the causal attention of the LLM. So, they should be trained to understand multimodal causal context semantics. Multimodal causality forms the basis of the learning process. Since the dream queries are basically used for image posterior modeling, the multimodal causality largely relies on the dream queries. However, during this process, the LLMs learn to understand multimodal causality as well since it is an end-to-end process.
  • As I presented in the paper, I have called this dream query-based image generation as diffusion distillation. The reason is that the diffusion decoder encoded multimodal semantics (distributions) are distilled through the dream queries during this process. Note that the SD can also be unfrozen, and sometimes, it leads to better performance, especially when you want the model to obtain better image-to-image generation capabilities.
  • From another perspective, the dream queries are latent representations, right? So, I believe they are the dark, learnable representations that implicitly carry the world knowledge, pretty much like a prototype of the latent world model.

I hope this answer helps you, and I welcome any further discussions.

Thanks for your timely response to my questions!
In fact, apart from my previous question, another question is that what's the difference between these two methods: 1. Directly input the generated context tokens of MLLM into Stable Diffusion for a new image. 2. Use dream queries as input of U-Net. I encountered this question in the morning when discussing your paper with my workmates and we failed to draw a conclusion. Could you give me a concise explanation? Thanks!

Hi @StupidDebugger,

  • If you directly input the context token, it surely can generate images. However, the performance and data efficiency are not as good as using dream queries. Dream queries are also used to achieve multimodal learning synergy.
  • What do you mean by using dream queries as input of the U-Net? We currently input the LLMs-encoded dream queries to the U-Net for the cross-attention.

Hi @StupidDebugger,

  • If you directly input the context token, it surely can generate images. However, the performance and data efficiency are not as good as using dream queries. Dream queries are also used to achieve multimodal learning synergy.
  • What do you mean by using dream queries as input of the U-Net? We currently input the LLMs-encoded dream queries to the U-Net for the cross-attention.

Sorry for causing your confusion about my saying of method 2. What I want to express is the method you employ in your paper and I'm clear about my questions after your response. Now I have no more questions and I'll close the issue, thanks.