Training issues
JACKYLUO1991 opened this issue · comments
The training process often shows 100% GPU utilization and the code gets stuck... Also, the model is very poor for hand rendering, is there a better suggestion?
For the stuck problem, please make sure that you have sufficient GPU memory and a compatible environment. We use 4 NVIDIA V100 GPUs(32G) for training in all our experiments. If you can provide some information about where or how did the code get stuck, perhaps we can help you analyze this problem. We are also working on improving this codebase for better efficiency.
Currently, since we use SMPL as a shape prior, which doesn't model hand gestures explicitly, it's still challenging for ELICIT to recover the actual geometry of hands. We will try to address this problem in our future work.
The display result of entering nvidia-smi when the program is stuck. This condition occurs every time it is run.
Could you please tell us which line of code the program is stuck on? We have not encountered a problem like this. It seems that the program gets stuck in a DataParallel component (cnl_mlp) which only runs on secondary GPUs(1~3).
Hi,
I have the same problem. The training is stuck in Epoch 1 Iter 5420. And GPU status is the same as @JACKYLUO1991. Have you solved the problem?
Hi @JACKYLUO1991, according to the feedback of @hengfei-wang, the custom op grid_sample_3d
causes a stuck problem. Please pull this commit and check whether it works for you. Again we thanks @hengfei-wang for his feedback!
@huangyangyi It seems that this is not the reason and the above problem still occurs.
@hengfei-wang Not really, did you find a solution? Could it be a GPU architecture problem, the author said he was using V100.
Hi, @JACKYLUO1991
I can train successfully after revising the code according to the author. For your problem, I don't know why.