prs-eth / Marigold

[CVPR 2024 - Oral, Best Paper Award Candidate] Marigold: Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation

Home Page:https://marigoldmonodepth.github.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Clarification Needed: Training and Inference Pipeline

ShenZheng2000 opened this issue · comments

I am new to diffusion models and am having difficulty understanding how Figures 2 (training) and 3 (inference) work.

Based on my understanding, here are the steps:

Inference:
(1) Take an image
(2) Extract features using latent encoder
(3) Add noise to image feature
(4) Use diffusion to remove noise
(5) Use latent decoder to obtain depth map

Training:
(1) Take an image and its depth map
(2) Extract features using latent encoder
(3) Add noise to depth feature
(4) Concat image features and noisy depth features
(5) Run diffusion to estimate the noise
(6) Compute the L2 difference between the actual and estimated noise.

Is my understanding correct?

I have a couple of questions:

Inference (Step 5): How does latent decoder knows how to obtain the depth map? Did you train this decoder, or did you use pretrained models from scratch? If you trained it, how did you manage the training process separately? If you used a pretrained model, the role of the diffusion model in this decoding process is unclear to me.

Training (Step 5): How is the diffusion model able to extract noise from the concatenated image and noisy depth features? I understand that the L2 difference can provide strong guidance, but extracting noise from such a complex concatenation of features seems counter-intuitive to me.

I have similar confusion regarding the second question. How can the model 'separate' the noise from the concatenated image and the noise-added depth map?
Besides,

Hi, there are a few points need to be clarified regarding your understanding:

  • The VAE encoder and decoder can handle depth map if one replicate it to three-channel (c.f. Section 3.2 Depth encoder and decoder).
  • The image latent is always clean, since it's condition.
  • U-Net output is depth latent (at the same shape as image latent).
  • During inference, depth latent starts from Gaussian noise, and iteratively denoised.
  • During training, noise is only added to the GT depth latent. The U-Net is fine-tuned to generate depth map latents (which can be decoded to depth maps).

Hi, there are a few points needs to be clarified regarding your understanding:

  • The VAE encoder and decoder can handle depth map if one replicate it to three-channel (c.f. Section 3.2 Depth encoder and decoder).
  • The image latent is always clean, since it's condition.
  • U-Net output is depth latent (at the same shape as image latent).
  • During inference, depth latent starts from Gaussian noise, and iteratively denoised.
  • During training, noise is only added to the GT depth latent. The U-Net is fine-tuned to generate depth map latents (which can be decoded to depth maps).

Thanks for your clarification. Can I interpret diffusion as a process of learning the joint distribution of a condition and a noised image? And can I set any combination of the condition and the input, such as an image and air dampness (XD), even if they have no relationship, due to the strong capacity of diffusion? Of course, in this work, it should fine-tune the SD using pairs of images and depth maps, which have a strong relationship.

I would rather interpret it the denoising process as an image-conditioned depth map generation process.

Yes, technically as long as you can feed it into this latent space (e.g. through the VAE encoder), you can condition on other things.