152334H / tortoise-tts-fast

Fast TorToiSe inference (5x or your money back!)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Implement samplers correctly

152334H opened this issue · comments

  • rudimentary dpm++2m implementation
  • explore other DPM-Solver samplers
  • figure out if k-diffusion is still possible
  • UniPC

Their project says they support $\epsilon_\theta(x_t, t)$ models, so I'll give it a go

Their code in https://github.com/wl-zhao/UniPC/blob/main/example/stable-diffusion/ldm/models/diffusion/uni_pc/uni_pc.py also seems very similar to the DPM-Solver repo, which i'll be integrating soon, so that's good

on a related note, I realised a few days ago (thanks to mrq) that my implementation of k-diffusion was actually completely wrong.

I'll be adding code that actually runs dpm++2m correctly in about an hour (the K diffusion integration is most likely screwed), then I can go for uniPC

I'll write a larger blog about this later, but to clarify, this is what happened:

  • I added functions for k-diffusion, but didn't actually call on them in my code
  • All "DPM++2M" results before today's commit were actually done with p_sample, which is basically random sampling of the gaussian distribution with the mean and variance values calculated by the model at each step. Yes, this means that the "really good 10 step results" were actually just plain DDIM.
  • After realising this, I tried to fix it. As it turns out, integrating the k-diffusion library into TorToiSe's diffusion model is actually a non-trivial task, because k-diffusion defines its own sigmas that conflict with the beta scheduler that's used to calculate $x_\theta(x_t,t)$ (which k-diffusion expects) in p_mean_variance. I tried to calculate the x0 prediction from the raw model output + karras' sigmas, but I got a bunch of noise. It's entirely possible I just failed to write the right integration code for the k-diffusion samplers, but I'm not going to try working on it for now.
  • I have instead opted to make use of the $\epsilon_\theta(x_t, t)$ outputs from the diffusion model directly, discarding the $\Sigma_\theta(x_t, t)$ prediction, and using it with the DPM-Solver repo to run DPM++2M. This "worked", but doesn't produce good results for steps<=20, and I assume this happens because of the discarded variance || because I wrote a wrong constant somewhere.

tldr: past samplers were fake; dpm++2m is now experimental but real, DDIM+cond_free is preferable for steps < 20 until better samplers exist.

Consequently, I'm making DDIM the default sampler for ultra_fast for now, and have created a new preset (very_fast) that uses DPM++2M with more steps.

All claims stated here only apply to fp32 inference; I have no idea what the results are like on --half yet.