find out nan in tensor

Question

find out nan in tensor

PJJie opened this issue 3 years ago · comments

When I replace multiple maxtools with softtools, I find out Nan in tensor

Alex Stergiou · Answer 1 · Thu Jan 14 2021 16:08:48 GMT+0800 (China Standard Time)

Returned NaN values are quite common when using CUDA as it is a low-level language and it does not integrate any internal checks for numerical overflows or underflows. PyTorch itself has a range of functions (e.g. torch.nan_to_num()) to deal with such cases. Simply wrapping your output with these functions should alleviate the issue.

I am also planning on including this in the coming repo commits.

Best,
Alex

Jiadai Sun · Answer 2 · Wed Sep 07 2022 15:48:47 GMT+0800 (China Standard Time)

Hi, @alexandrosstergiou, I would like to know if this bug has been fixed or any progress? I'm also using softpool in a project and I don't have this problem, but other people have this problem with my project haomo-ai/MotionSeg3D#6

Alex Stergiou · Answer 3 · Wed Sep 07 2022 19:06:26 GMT+0800 (China Standard Time)

Hi @MaxChanger. Most NaNvalue-problems in fwd/bwd calls have been fixed after torch 1.6 where torch.amp was integrated alongside its decorators for custom functions. After commit f49fd84, I had stable runs on both full and mixed precision settings over different GPUs, environments, and configurations. Since then I have not noticed any NaN values occurring whilst training in other projects.

Perhaps it will be worth suggesting to anyone opening an issue in your project to re-install the latest version of softpool and ensure that they are using torch >= 1.7 (preferably the latest one) to be sure?

Jiadai Sun · Answer 4 · Wed Sep 07 2022 19:23:17 GMT+0800 (China Standard Time)

Hi @alexandrosstergiou. Thank you for your kind reply. I have conducted nearly a hundred experiments on 4~5 different GPU servers, and I have not found this issue (nan) too. Thus, I thought your project was robust enough.
After your confirmation, I am more at ease, and I will also cooperate with other people to confirm this issue.