doubts about the whole concept

Question

doubts about the whole concept

iperov opened this issue a year ago · comments

why does the network have to train the way you intended?

In every training step, the target is maximum quality non-degraded sample.

if the network is good enough, it learns maximum quality in 1 pass, what are 49 more steps for?

or are you assuming that the network is bad enough, and therefore instead of the target it learns some improvement to the target, e.g. from a blurred image, a less blurred image? But then applying 49 degradations to such an image, we will not get an improvement, because the blurriness of the image after the first pass is more than will be obtained with 49 degradations in proposed sampling method.

At the moment I am experimenting with increasing the detail of the image after autoencoder using ColdDiffusion.
My algorithm is to take the output image from frozen autoencoder, calculate the difference between target and pred, and apply degradation to blur this difference by applying the difference to pred.

example degradations in dynamic:

2023-04-26_09-37-07.mp4

Then I trained the model with 50 steps and got this. Background was not trained, only the face.

2023-04-26_09-40-28.mp4

Details are increased.

But then I checked the output from 1st and further passes and I got the same image all time !

and by the way it has more details if not to use sampling degradations.

I would love to hear any comments or thoughts from you on this.

Jiahao Huang · Answer 1 · Fri Jul 07 2023 20:38:38 GMT+0800 (China Standard Time)

I agree.
I use this model for image restoration.
If the denoise_fn is trained good enough. There is no need for the rest reverse step.
And there are not many improvement for using the reverse process.

Saifullah · Answer 2 · Tue Jul 11 2023 17:40:55 GMT+0800 (China Standard Time)

I have the same doubts. The sampling does not actually do much after training the model and taking 1 step gives almost exactly the same results taking 1000 diffusion steps. This makes me think the model ends up just using the noising as a way to augment the inputs in various ways to become more generalized but overall if we can just use 1 step, I don't know if we can even consider this to be a 'diffusion model' anymore. Starts sounding more like a simple unet model at that point. Will be great if the authors can clarify some of these things.

Jiahao Huang · Answer 3 · Tue Jul 11 2023 17:53:56 GMT+0800 (China Standard Time)

Thanks @saifullah3396 for the information. It is great to know I am not doing something with bug. I have checked it for a long time.

Arpit Bansal · Answer 4 · Wed Jul 12 2023 01:04:55 GMT+0800 (China Standard Time)

Hi @JiahaoHuang99, @JiahaoHuang99 and @saifullah3396,
Yes if the amount of degradation is less, i.e. we have enough signal to restore the image, neither the noise based diffusion models nor any other restoration based diffusion models require multiple steps of UNet if it is good enough. However, as you mentioned if the UNet is not good enough, one can use the sampling method proposed in the paper and improve on their restoration. We specifically discuss restoration in section 4 of the paper for various degradations. Infact in some cases like mnist one generates a whole new number. However the section 4 is not about generation but indicates that one can use them for generation.

However the generation using other degradations is discussed in the section 5 of the paper. In this section we degrade the images not just to a stage where enough signal is present but to the stage where almost no information is left to reconstruct the whole image. This is exactly like the case of noise-based diffusion models, where at T=1000 we have no information regarding the original image. For example in the case of blur, we blur images to the extent only single value is present in each channel, in the case of animorphs we degrade them to the extent that final image is just the animal, and so on. Hence in this situation the 1-step of UNet is not good to reconstruct the images. I will refer figure 17, where we clearly show how generation using one step of UNet looks different from the one via sampling. This is exactly analogous to noise-based diffusion models where the first step produces a blurry image.

@saifullah3396 Can you please expand further on "This makes me think the model ends up just using the noising as a way to augment the inputs in various ways to become more generalized but overall if we can just use 1 step, I don't know if we can even consider this to be a 'diffusion model' anymore. "
To make it more clear, the "diffusion model" essence is from the fact that the noise based diffusion models do not generate an image in the single step and required multiple steps, which allows them to insert high frequency signals. Similarly, we show in section 5 the same phenomenon where for other degradations as well one can generate an image using multiple steps, using our proposed sampling algorithm. Figure 17, further shows how single step looks vastly different from the one step of UNet.

Saifullah · Answer 5 · Wed Jul 12 2023 04:03:37 GMT+0800 (China Standard Time)

Hi @arpitbansal297 thank you very much for a detailed clarification on that! :)

I think when you mention the animal to human conversion it makes a lot of sense. I understand that the process itself is exactly like basic noisy diffusion which is why I liked this idea a lot because often times we aren't really dealing with gaussian noise. But do you think that even if we can get the output with a single step, it can still be considered diffusion? (this is what i meant by what i said. My comment was only in case if we can restore image with a single step)
Like for example a single step simply means applying the unet forward pass so if that is possible for our scenario do you think a simple image to image translation with a basic unet should be able to capture the same distribution? Or do you think the forward diffusion during training still holds importance in that case even if your model can restore the image in a single step?

I am just more confused as to why a single step works. Is it simply the restoration is too easy? Or could there be some more complex reasoning behind it? Also does it mean that if we apply gaussian noise only to a small magnitude it could also be removed in a single step in the same manner?

Arpit Bansal · Answer 6 · Thu Jul 13 2023 00:07:32 GMT+0800 (China Standard Time)

Hi @saifullah3396
In the case of restoration discussed in the section 4 where enough information is present, yes single step works well enough and the only benefit of using unet multiple times is to just improve the quality. And yes in this case it's not exactly a diffusion model as one is not generating an image but restoring it. And yes it's same for noise as well, i.e. for small noise one step of Unet is good enough, we show this in the Appendix.
The simple reason one step works in section 4 is that degraded images have enough information to be reconstructed in one step.

However, in section 5 where we discuss cold generation, the single step is not good at all. However if one still gets a perfect image in step itself, even then one can use multiple iterations just to refine it (as shown in consistency models by OpenAI).