Question about diffusion model

Question

Question about diffusion model

lmdsx opened this issue a year ago · comments

Thanks for your excellent work, but I have a question.

p = self.decoder(emb).squeeze(2)  
p += weight

why did you add the weight to this distribution? p should contain this information. I don't understand the meaning of this step.

Peng Jin · Answer 1 · Wed Oct 25 2023 11:22:21 GMT+0800 (China Standard Time)

Thank you for raising the issue. This is an interesting phenomenon that we observed when we were doing experiments. Specifically, we find that model training is much more stable after adding this residual connection.

You can try to remove this residual connection, but it will reduce the performance of the model.

lmdsx · Answer 2 · Thu Oct 26 2023 09:53:37 GMT+0800 (China Standard Time)

Thank you for your answer. Actually, after reading your paper, I am very interested in the denoising structure. In addition to the previous question, there are two more corresponding questions. 1. Why is the time-position 't' placed on the second attention instead of the first frame attention? 2. Why is the subsequent video feature concatenated with the previous video representation? Is it an empirical setting or experimental verification? Is the second question because the second attention reduces the fine-grained information of the frame provided by the text?

If you can help answer these questions, I would be very grateful.

Peng Jin · Answer 3 · Thu Oct 26 2023 19:49:09 GMT+0800 (China Standard Time)

In fact, the structure of our denoising network is empirical.

In response to the first question, we have not tried the structure you mentioned. We think your idea is enlightening because placing the time position embedding in the first attention allows the model to focus on different frames at different time steps.

In response to the second question, our experience shows that concatenating all frames and text together when using contrastive loss does not work well. However, we have not tried to concatenate all the frames and text together in the diffusion models, so your idea is probably better.

lmdsx · Answer 4 · Fri Oct 27 2023 08:45:12 GMT+0800 (China Standard Time)

Thank you very much for your response. Your open-source work has been immensely helpful to me. I am looking forward to your future work.