xuchen-ethz / fast-snarf

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Training won't converge.

yashkant opened this issue · comments

Hi @xuchen-ethz!

Thanks so much for releasing the code!

I quickly tested this using the command python train.py subject=50002, and found really great speed-up!

But unfortunately, the loss did not converge to reasonable value. Do you have any idea on what could have gone wrong?

Appreciate your help!

image

Hi @yashkant

I did a quick test and got the following training log which looks normal:
image

One possible reason for the different behavior could be the GPU model. In our previous project SNARF (the slow version of this repo), we observed numerical instability on 3080/3090. I haven't tested Fast-SNARF on 3080/3090 yet. Are you maybe using 3080/3090?

Thanks for the quick response!

I recently switched to 32GB V100, and I had the old SNARF code working on a 16GB V100.

I am planning to rerun the old SNARF code on the newer 32GB gpu to check if that works.

Will keep you posted!

Hi @yashkant

When debugging for another issue I realized that there are two parts of the code that do no not belong to this version and they cause the convergence problem you experienced earlier. I was testing on a different branch so did not catch this problem.

Really sorry for the problem and for your time I might have wasted! Now things are fixed and the training should work normally with the updated code.

Best Regards,
Xu

Thanks @xuchen-ethz, really appreciate the fix!

The code works for me now! :)

great to know! thanks for the feedback