Training won't converge.
yashkant opened this issue · comments
Hi @xuchen-ethz!
Thanks so much for releasing the code!
I quickly tested this using the command python train.py subject=50002
, and found really great speed-up!
But unfortunately, the loss did not converge to reasonable value. Do you have any idea on what could have gone wrong?
Appreciate your help!
Hi @yashkant
I did a quick test and got the following training log which looks normal:
One possible reason for the different behavior could be the GPU model. In our previous project SNARF (the slow version of this repo), we observed numerical instability on 3080/3090. I haven't tested Fast-SNARF on 3080/3090 yet. Are you maybe using 3080/3090?
Thanks for the quick response!
I recently switched to 32GB V100, and I had the old SNARF code working on a 16GB V100.
I am planning to rerun the old SNARF code on the newer 32GB gpu to check if that works.
Will keep you posted!
Hi @yashkant
When debugging for another issue I realized that there are two parts of the code that do no not belong to this version and they cause the convergence problem you experienced earlier. I was testing on a different branch so did not catch this problem.
Really sorry for the problem and for your time I might have wasted! Now things are fixed and the training should work normally with the updated code.
Best Regards,
Xu
Thanks @xuchen-ethz, really appreciate the fix!
The code works for me now! :)
great to know! thanks for the feedback