About the step_loss == nan

Question

About the step_loss == nan

maobenz opened this issue 7 months ago · comments

Hello,
Thanks for your brilliant work!
When I run the code, I find the step loss always equals nan when I use the bdd dataset. After carefully checking the code, I find the last block of the upsample_block' s output will be nan. I just use the fp16 model and follow the pipeline.
Could anyone tell me what is the reason?

Thanks a lot!

Pengxiang Li · Answer 1 · Sun Jan 07 2024 23:41:27 GMT+0800 (China Standard Time)

Could you please provide some more details, such as your specific settings, device information, and so on?

maobenz · Answer 2 · Mon Jan 08 2024 12:44:57 GMT+0800 (China Standard Time)

Thanks a lot!

I tried different resolution of bdd images but all step_loss is nan. I just use one video clip of the bdd and split the videos into some images to be fed into the model. I have tried the GTX3090 and A100.

When I use the fp32 model , the step loss is not nan but fp16'model' s loss is still nan. However, in the last block of upsample_block, query @ key.transpose(-1, -2) is too large to show nan.

My model id is "stabilityai/stable-video-diffusion-img2vid-xt", but when i tried other model is , it also doesn't work.

maobenz · Answer 3 · Mon Jan 08 2024 12:51:55 GMT+0800 (China Standard Time)

My torch version is 1.13.1+cu116, and my diffusers version is 0.25.0. Even if I input the all zeros as input, the loss is also nan.

maobenz · Answer 4 · Mon Jan 08 2024 15:29:27 GMT+0800 (China Standard Time)

OK， i have found the issue, the torch version should be 2.0.1 rather than 1.13.1. When I change the version of pytorch, the problem has been solved.

Pengxiang Li · Answer 5 · Mon Jan 08 2024 17:58:41 GMT+0800 (China Standard Time)

Ah, I see, but in fact, I might not be able to answer why modifications to the PyTorch version would cause this issue.😢

Xi Liu · Answer 6 · Tue Apr 02 2024 09:30:57 GMT+0800 (China Standard Time)

I upgraded the PyTorch to 2.1.2 but still has this problem, I can only train on the bf16. Any solutions?