Clarification Needed: Training and Inference Pipeline

Question

Clarification Needed: Training and Inference Pipeline

ShenZheng2000 opened this issue a month ago · comments

ShenZheng2000 commented a month ago

I am new to diffusion models and am having difficulty understanding how Figures 2 (training) and 3 (inference) work.

Based on my understanding, here are the steps:

Inference:
(1) Take an image
(2) Extract features using latent encoder
(3) Add noise to image feature
(4) Use diffusion to remove noise
(5) Use latent decoder to obtain depth map

Training:
(1) Take an image and its depth map
(2) Extract features using latent encoder
(3) Add noise to depth feature
(4) Concat image features and noisy depth features
(5) Run diffusion to estimate the noise
(6) Compute the L2 difference between the actual and estimated noise.

Is my understanding correct?

I have a couple of questions:

Inference (Step 5): How does latent decoder knows how to obtain the depth map? Did you train this decoder, or did you use pretrained models from scratch? If you trained it, how did you manage the training process separately? If you used a pretrained model, the role of the diffusion model in this decoding process is unclear to me.

Training (Step 5): How is the diffusion model able to extract noise from the concatenated image and noisy depth features? I understand that the L2 difference can provide strong guidance, but extracting noise from such a complex concatenation of features seems counter-intuitive to me.

山水夜止 · Answer 1 · Sat Jun 15 2024 23:41:48 GMT+0800 (China Standard Time)

I have similar confusion regarding the second question. How can the model 'separate' the noise from the concatenated image and the noise-added depth map?
Besides,

Bingxin Ke · Answer 2 · Tue Jul 02 2024 15:54:33 GMT+0800 (China Standard Time)

Hi, there are a few points need to be clarified regarding your understanding:

The VAE encoder and decoder can handle depth map if one replicate it to three-channel (c.f. Section 3.2 Depth encoder and decoder).
The image latent is always clean, since it's condition.
U-Net output is depth latent (at the same shape as image latent).
During inference, depth latent starts from Gaussian noise, and iteratively denoised.
During training, noise is only added to the GT depth latent. The U-Net is fine-tuned to generate depth map latents (which can be decoded to depth maps).

山水夜止 · Answer 3 · Tue Jul 02 2024 16:14:16 GMT+0800 (China Standard Time)

Hi, there are a few points needs to be clarified regarding your understanding:

The VAE encoder and decoder can handle depth map if one replicate it to three-channel (c.f. Section 3.2 Depth encoder and decoder).

The image latent is always clean, since it's condition.

U-Net output is depth latent (at the same shape as image latent).

During inference, depth latent starts from Gaussian noise, and iteratively denoised.

During training, noise is only added to the GT depth latent. The U-Net is fine-tuned to generate depth map latents (which can be decoded to depth maps).

Thanks for your clarification. Can I interpret diffusion as a process of learning the joint distribution of a condition and a noised image? And can I set any combination of the condition and the input, such as an image and air dampness (XD), even if they have no relationship, due to the strong capacity of diffusion? Of course, in this work, it should fine-tune the SD using pairs of images and depth maps, which have a strong relationship.

Bingxin Ke · Answer 4 · Tue Jul 02 2024 16:29:11 GMT+0800 (China Standard Time)

I would rather interpret it the denoising process as an image-conditioned depth map generation process.

Yes, technically as long as you can feed it into this latent space (e.g. through the VAE encoder), you can condition on other things.