archinetai / audio-diffusion-pytorch

Audio generation using diffusion models, in PyTorch.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

NaN after training for a while

jameshball opened this issue · comments

Hi!

I'm having an issue training with the basic model provided in the README. After training on the LibriSpeech dataset for about 20 epochs, I start getting NaN losses returned from the model, and when sampling and saving to a file I just get silent audio randomly.

I had a go at debugging but couldn't really find the issue, other than the first NaN in the forward pass I could find was the input to the ResNet block. Not sure if this is helpful, but I've added my debug output here: output.log. The prints were just me testing in various forward functions whether the inputs and outputs were NaN but I didn't isolate any lines.

My training script is pretty short and I'm not doing anything particularly weird that would cause this I don't think! You can have a look here: https://github.com/jameshball/audio-diffusion/blob/master/train.py

Also here's a snipper of my output from training where it turns to nan loss: nan.txt

I should also be able to follow up with a google drive link to download the checkpoint so you can test it more easily - you might need to modify train.py to remove some wandb calls functions and just load the checkpoint from disk but should be straightforward. Alternatively, I also get NaNs when just sampling from the model here: https://github.com/jameshball/audio-diffusion/blob/master/sample.py

Please let me know if there's anything I can help with as this would be great to fix!

James

Closing as I haven't experienced the issue since and have also made this repo private. Thanks for the library!!

Hi @jameshball , thanks for sharing the code! I wonder if you changed anything when you fixed this issue.

I didn't change anything :/ just reran and haven't experienced it since so this could probably be reopened if you've experienced it too

Oh cool, thanks for clarifying!

Did anyone else get this nan loss? Do either of your know how many iterations it took for you to get this? It happened for me at 840 iterations. I tried to go back and restart training from the last checkpoint before this but ended up with Nan again. Was anyone who had this issue able to restart from an earlier checkpoint and move past this?

**Update: I ran back to earlier checkpoints and ended up at Nan at exactly the same place on several occasions. I am winding back further, but this may be that there is an error with the model early on, that does not express itslef until a number of iterations.

@fred-dev What was the quality of the sample output from the model from which the NaN loss was output? Did it result in reasonable speech data?

@0417keito I was not using voice, but a waveform dataset. The generation was OK, not nearly as good as the published results. I never worked out how to actually avoid the NAN loss. I wonder if anyone has sucessfully used this outside of the publishers?

Hey @fred-dev! I'm also having this NaN loss problem. Did you manage to solve it?

Hey @fred-dev! I'm also having this NaN loss problem. Did you manage to solve it?

Nope. I got a bit further using the version of this repo that matches the publication. In the end I switched to work with stable audio tools.