Diffusion loss not decreasing

Question

Diffusion loss not decreasing

aniketp02 opened this issue 2 years ago · comments

Hi,
I have trained the GradTTS model on the Indian accent English dataset, and the results are pretty awesome.

Looking at the logs, I was startled to see that the diffusion loss throughout the training was not decreasing unlike other losses, and was also fluctuating a lot. Can anyone explain to me why this is the case and if the diffusion loss fluctuates so much why is it used in the total loss calculation?

I have attached my tensorboard outputs.

Training Diffusion Loss
Training Prior Loss
Training Duration Loss

Ivan Vovk · Answer 1 · Fri Jun 03 2022 22:35:04 GMT+0800 (China Standard Time)

@aniketp02 Hi! All 3 losses are must-have to train the model properly. What you have observed about diffusion loss is a normal behaviour, which we discussed in Section 4 of our Grad-TTS paper.

The denoising score matching objective we want to minimize to train a diffusion model is the integral $\int_0^T \lambda(t) ||\nabla_x \log p (x_t|x_0) - s_\theta(x_t, t)||^2 dt$ over the time interval $t \in [0, T]$. The only possible way to estimate it is to use the Monte-Carlo method: sampling uniformly distributed $t$ and compute the average loss in these points at each training step. Random sampling of $t$ induces high variance (especially for small batch size), that is why it seems like it just randomly oscillates after some training stage. However, on average it minimizes. At the same time, we need to be very accurate (meaning up to VERY small changes of loss) in gradient prediction over the whole continuous interval $[0, T]$ to generate high-quality samples at inference and it takes many time for the score network $s_\theta(x_t, t)$ to achieve this, meaning that the loss will converge slowly on average.

Combining these facts we get such diffusion loss behavior. Nonetheless, that doesn't mean it is unnecessary to optimize, moreover it is crucial. Otherwise, Grad-TTS diffusion decoder would produce just noise. Finally, diffusion loss does the job well if we check the energy function it corresponds to. Look at this issue: #9.