pixeli99 / SVD_Xtend

Stable Video Diffusion Training Code and Extensions.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

About the step_loss == nan

maobenz opened this issue · comments

Hello,
Thanks for your brilliant work!
When I run the code, I find the step loss always equals nan when I use the bdd dataset. After carefully checking the code, I find the last block of the upsample_block' s output will be nan. I just use the fp16 model and follow the pipeline.
Could anyone tell me what is the reason?

Thanks a lot!

Could you please provide some more details, such as your specific settings, device information, and so on?

Thanks a lot!

I tried different resolution of bdd images but all step_loss is nan. I just use one video clip of the bdd and split the videos into some images to be fed into the model. I have tried the GTX3090 and A100.

When I use the fp32 model , the step loss is not nan but fp16'model' s loss is still nan. However, in the last block of upsample_block, query @ key.transpose(-1, -2) is too large to show nan.

My model id is "stabilityai/stable-video-diffusion-img2vid-xt", but when i tried other model is , it also doesn't work.

My torch version is 1.13.1+cu116, and my diffusers version is 0.25.0. Even if I input the all zeros as input, the loss is also nan.

OK, i have found the issue, the torch version should be 2.0.1 rather than 1.13.1. When I change the version of pytorch, the problem has been solved.

Ah, I see, but in fact, I might not be able to answer why modifications to the PyTorch version would cause this issue.😢

I upgraded the PyTorch to 2.1.2 but still has this problem, I can only train on the bf16. Any solutions?