Invalid training loss

Question

Invalid training loss

hshiah opened this issue a year ago · comments

The training loss on brats2020 of new version is usually nan.

Shi Haochen · Answer 1 · Sat Feb 11 2023 12:44:51 GMT+0800 (China Standard Time)

When the loss is not NAN, the grad_norm is extremely large like 7.44e+04, while the previous version is usually around 10.
May I ask the reason? I train the model on raw brats2020 training data.

Junde Wu · Answer 2 · Sat Feb 11 2023 15:57:59 GMT+0800 (China Standard Time)

I fixed the bug, please update the project and try again.

Shi Haochen · Answer 3 · Mon Feb 13 2023 00:30:24 GMT+0800 (China Standard Time)

Hi, I tried the newest version and the model is stuck at training stage. I checked the GPU memory usage and it keeps a small value (around 2500 MiB) instead of normal value.

Junde Wu · Answer 4 · Mon Feb 13 2023 10:02:50 GMT+0800 (China Standard Time)

@hshiah I checked it again, it works fine in my workplace. Did you run it on GPU? You may need to add --gpu 0.

Shi Haochen · Answer 5 · Mon Feb 13 2023 11:10:05 GMT+0800 (China Standard Time)

I tried it again and it works. Sorry for my mistake. 获取Outlook for Android<https://aka.ms/AAb9ysg>

…

________________________________ From: Wu Junde ***@***.***> Sent: Monday, February 13, 2023 10:03:01 AM To: WuJunde/MedSegDiff ***@***.***> Cc: SHI Haochen ***@***.***>; Mention ***@***.***> Subject: Re: [WuJunde/MedSegDiff] Invalid training loss (Issue #31) @hshiah<https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fhshiah&data=05%7C01%7Chshiah%40connect.ust.hk%7C70ebb4885746421378d808db0d6670ff%7C6c1d415239d044ca88d9b8d6ddca0708%7C1%7C0%7C638118505851489372%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=oCiKK78t6bHBRzNjvcaaqUwpha1q6qGz4Q8OMpLvrpM%3D&reserved=0> I checked it again, it works fine in my workplace. Did you run it on GPU? You may need to add --gpu 0. ― Reply to this email directly, view it on GitHub<https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FWuJunde%2FMedSegDiff%2Fissues%2F31%23issuecomment-1427221135&data=05%7C01%7Chshiah%40connect.ust.hk%7C70ebb4885746421378d808db0d6670ff%7C6c1d415239d044ca88d9b8d6ddca0708%7C1%7C0%7C638118505851489372%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=m0Y0lGpe%2BjHHrSzAKekKhU%2B5fL8ND94O8ujsp93jwLE%3D&reserved=0>, or unsubscribe<https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAXWFZZ67ECQL7SLV6N2XRQTWXGI5LANCNFSM6AAAAAAUYPOQQI&data=05%7C01%7Chshiah%40connect.ust.hk%7C70ebb4885746421378d808db0d6670ff%7C6c1d415239d044ca88d9b8d6ddca0708%7C1%7C0%7C638118505851489372%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Q16VETiptmuufXRKx9O4lZ5eu%2FhJMkESkrLLwZg13TQ%3D&reserved=0>. You are receiving this because you were mentioned.Message ID: ***@***.***>