openai / consistency_models

Official repo for consistency models.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Inconsistent loss term with paper

quantumiracle opened this issue · comments

Hi,

Thanks for open-sourcing this wonderful project!

However, I notice in CT training, the loss term has the target model denoising for $x_{t_{n+1}}$ instead of $x_{t_n}$, which is different from the loss (target model denoising for $x_{t_n}$) stated in Alg.3 CT in the paper. Did I miss something, or this mutation does not matter?

There are some differences in the paper in general. For example the rho scheduling is reversed in the paper, the formula used in the code here is more similar to EDM.

There are other differences though, for example the method for adding noise when computing the distiller_target is slightly different.

It's hard to know if the paper or the code is the better approach. I'm more inclined to think that the code is more up to date, but I'm basing that only on the release date of the repo being after the paper publication date.

Did you guys understood the preconditioning of the time signal in the denoising method of the karras diffusion class? I also cannot find this equation in the paper:

  rescaled_t = 1000 * 0.25 * th.log(sigmas + 1e-44)
  model_output = model(c_in * x_t, rescaled_t, **model_kwargs)
  denoised = c_out * model_output + c_skip * x_t

from the denoising method: https://github.com/openai/consistency_models/blob/main/cm/karras_diffusion.py#L346

Without the 1000 it is the same noise conditioning used in the EDM preconditioning, but I don't understand the new factor of 1000.

There are quite a few differences. I've raised an additional ticket and email Yang Song to hopefully know which is better.

#18

I would need to check, but the scaling might be due to the how the temporal embedding is computed in the model. The EDM paper might be using a method that likes small floats, whereas something like a SinusoidalPE would prefer larger values.

Thanks for the info, that's good to hear! Let us know, when you hear something.

They use an MLP to encode the timestep in the unet. So large values should not be better. But maybe I am missing something there. Also the general preconditioning of the noise is not mentioned in the paper at all.

I've not checked, but is the MLP following a concatenation of sin-cos values?

I think c_in scaling does make sense to use. x_t is going to be very large towards large t, so scaling it should keep the variance at a nicer scale for the NN.

Good point, I missed one part, where they are using a Sinusoidal Timestep Embedding to encode the timestep before the MLP:

def timestep_embedding(timesteps, dim, max_period=10000):

So this could explain it.