Nan issue

Question

Nan issue

CaiwenXu opened this issue a year ago · comments

Hi， many thanks for your excellent work! I have a problem when training the VQ GAN, the loss will suddenly become nan, and do you know why this happens? I used the LIDC dataset.

benearnthof · Answer 1 · Tue Jul 25 2023 19:16:18 GMT+0800 (China Standard Time)

I'm currently having the same problem I used the exact same configs provided here and still no luck, very unstable training.
The Model does also suffer from mode collapse after the Discriminator starts training.

benearnthof · Answer 2 · Tue Jul 25 2023 19:24:56 GMT+0800 (China Standard Time)

I believe this problem may stem from the accumulate_grad_batches parameter. I trained a run for more than 50000 steps successfully, but trying to replicate training with accumulate_grad_batches > 1 runs into the nan problem. @CWX-student can you confirm this or do you have any other info on your end?

benearnthof · Answer 3 · Tue Jul 25 2023 20:02:33 GMT+0800 (China Standard Time)

Update: Using setting the precision parameter in the config to at least 32 seems to alleviate this problem. https://discuss.pytorch.org/t/distributed-training-gives-nan-loss-but-single-gpu-training-is-fine/63664/6