NaN after training for a while

Question

NaN after training for a while

jameshball opened this issue a year ago · comments

Hi!

I'm having an issue training with the basic model provided in the README. After training on the LibriSpeech dataset for about 20 epochs, I start getting NaN losses returned from the model, and when sampling and saving to a file I just get silent audio randomly.

I had a go at debugging but couldn't really find the issue, other than the first NaN in the forward pass I could find was the input to the ResNet block. Not sure if this is helpful, but I've added my debug output here: output.log. The prints were just me testing in various forward functions whether the inputs and outputs were NaN but I didn't isolate any lines.

My training script is pretty short and I'm not doing anything particularly weird that would cause this I don't think! You can have a look here: https://github.com/jameshball/audio-diffusion/blob/master/train.py

Also here's a snipper of my output from training where it turns to nan loss: nan.txt

I should also be able to follow up with a google drive link to download the checkpoint so you can test it more easily - you might need to modify train.py to remove some wandb calls functions and just load the checkpoint from disk but should be straightforward. Alternatively, I also get NaNs when just sampling from the model here: https://github.com/jameshball/audio-diffusion/blob/master/sample.py

Please let me know if there's anything I can help with as this would be great to fix!

James

James H Ball · Answer 1 · Thu Feb 23 2023 16:18:13 GMT+0800 (China Standard Time)

Checkpoint file: https://drive.google.com/file/d/1ndzRNNOZrY9jY6nIhSrVxQrVuBiAXojT/view?usp=sharing

James H Ball · Answer 2 · Mon Feb 27 2023 04:24:07 GMT+0800 (China Standard Time)

Closing as I haven't experienced the issue since and have also made this repo private. Thanks for the library!!

Tingle Li · Answer 3 · Thu Mar 16 2023 14:43:05 GMT+0800 (China Standard Time)

Hi @jameshball , thanks for sharing the code! I wonder if you changed anything when you fixed this issue.

James H Ball · Answer 4 · Thu Mar 16 2023 16:54:06 GMT+0800 (China Standard Time)

I didn't change anything :/ just reran and haven't experienced it since so this could probably be reopened if you've experienced it too

Tingle Li · Answer 5 · Fri Mar 17 2023 01:58:04 GMT+0800 (China Standard Time)

Oh cool, thanks for clarifying!

Frederick Rodrigues · Answer 6 · Tue Sep 19 2023 16:16:36 GMT+0800 (China Standard Time)

Did anyone else get this nan loss? Do either of your know how many iterations it took for you to get this? It happened for me at 840 iterations. I tried to go back and restart training from the last checkpoint before this but ended up with Nan again. Was anyone who had this issue able to restart from an earlier checkpoint and move past this?

**Update: I ran back to earlier checkpoints and ended up at Nan at exactly the same place on several occasions. I am winding back further, but this may be that there is an error with the model early on, that does not express itslef until a number of iterations.

0417itsuki · Answer 7 · Tue Dec 05 2023 21:34:28 GMT+0800 (China Standard Time)

@fred-dev What was the quality of the sample output from the model from which the NaN loss was output? Did it result in reasonable speech data?

Frederick Rodrigues · Answer 8 · Tue Dec 05 2023 22:07:10 GMT+0800 (China Standard Time)

@0417keito I was not using voice, but a waveform dataset. The generation was OK, not nearly as good as the published results. I never worked out how to actually avoid the NAN loss. I wonder if anyone has sucessfully used this outside of the publishers?

Federico Miotello · Answer 9 · Tue Jun 11 2024 05:53:21 GMT+0800 (China Standard Time)

Hey @fred-dev! I'm also having this NaN loss problem. Did you manage to solve it?

Frederick Rodrigues · Answer 10 · Tue Jun 11 2024 15:38:11 GMT+0800 (China Standard Time)

Hey @fred-dev! I'm also having this NaN loss problem. Did you manage to solve it?

Nope. I got a bit further using the version of this repo that matches the publication. In the end I switched to work with stable audio tools.