Training won't converge.

Question

Training won't converge.

yashkant opened this issue 2 years ago · comments

Yash Kant commented 2 years ago

Hi @xuchen-ethz!

Thanks so much for releasing the code!

I quickly tested this using the command python train.py subject=50002, and found really great speed-up!

But unfortunately, the loss did not converge to reasonable value. Do you have any idea on what could have gone wrong?

Appreciate your help!

xuchen-ethz · Answer 1 · Tue Dec 13 2022 18:44:24 GMT+0800 (China Standard Time)

Hi @yashkant

I did a quick test and got the following training log which looks normal:

One possible reason for the different behavior could be the GPU model. In our previous project SNARF (the slow version of this repo), we observed numerical instability on 3080/3090. I haven't tested Fast-SNARF on 3080/3090 yet. Are you maybe using 3080/3090?

Yash Kant · Answer 2 · Tue Dec 13 2022 22:54:18 GMT+0800 (China Standard Time)

Thanks for the quick response!

I recently switched to 32GB V100, and I had the old SNARF code working on a 16GB V100.

I am planning to rerun the old SNARF code on the newer 32GB gpu to check if that works.

Will keep you posted!

xuchen-ethz · Answer 3 · Thu Jan 12 2023 20:13:57 GMT+0800 (China Standard Time)

Hi @yashkant

When debugging for another issue I realized that there are two parts of the code that do no not belong to this version and they cause the convergence problem you experienced earlier. I was testing on a different branch so did not catch this problem.

Really sorry for the problem and for your time I might have wasted! Now things are fixed and the training should work normally with the updated code.

Best Regards,
Xu

Yash Kant · Answer 4 · Fri Jan 13 2023 00:46:58 GMT+0800 (China Standard Time)

Thanks @xuchen-ethz, really appreciate the fix!

The code works for me now! :)

xuchen-ethz · Answer 5 · Mon Jan 16 2023 06:11:39 GMT+0800 (China Standard Time)

great to know! thanks for the feedback