Training issues

Question

Training issues

JACKYLUO1991 opened this issue a year ago · comments

JackyLUO1991 commented a year ago

The training process often shows 100% GPU utilization and the code gets stuck... Also, the model is very poor for hand rendering, is there a better suggestion?

Yangyi Huang · Answer 1 · Tue Feb 14 2023 20:22:40 GMT+0800 (China Standard Time)

For the stuck problem, please make sure that you have sufficient GPU memory and a compatible environment. We use 4 NVIDIA V100 GPUs(32G) for training in all our experiments. If you can provide some information about where or how did the code get stuck, perhaps we can help you analyze this problem. We are also working on improving this codebase for better efficiency.
Currently, since we use SMPL as a shape prior, which doesn't model hand gestures explicitly, it's still challenging for ELICIT to recover the actual geometry of hands. We will try to address this problem in our future work.

JackyLUO1991 · Answer 2 · Wed Feb 15 2023 09:40:16 GMT+0800 (China Standard Time)

The display result of entering nvidia-smi when the program is stuck. This condition occurs every time it is run.

Yangyi Huang · Answer 3 · Wed Feb 15 2023 14:35:02 GMT+0800 (China Standard Time)

The display result of entering nvidia-smi when the program is stuck. This condition occurs every time it is run.

Could you please tell us which line of code the program is stuck on? We have not encountered a problem like this. It seems that the program gets stuck in a DataParallel component (cnl_mlp) which only runs on secondary GPUs(1~3).

ShowMeCode · Answer 4 · Mon Feb 20 2023 04:29:20 GMT+0800 (China Standard Time)

Hi,

I have the same problem. The training is stuck in Epoch 1 Iter 5420. And GPU status is the same as @JACKYLUO1991. Have you solved the problem?

Yangyi Huang · Answer 5 · Mon Feb 20 2023 19:12:06 GMT+0800 (China Standard Time)

Hi @JACKYLUO1991, according to the feedback of @hengfei-wang, the custom op grid_sample_3d causes a stuck problem. Please pull this commit and check whether it works for you. Again we thanks @hengfei-wang for his feedback!

JackyLUO1991 · Answer 6 · Mon Feb 20 2023 23:31:46 GMT+0800 (China Standard Time)

@huangyangyi It seems that this is not the reason and the above problem still occurs.

JackyLUO1991 · Answer 7 · Tue Feb 21 2023 09:51:46 GMT+0800 (China Standard Time)

@hengfei-wang Not really, did you find a solution? Could it be a GPU architecture problem, the author said he was using V100.

ShowMeCode · Answer 8 · Wed Feb 22 2023 01:22:27 GMT+0800 (China Standard Time)

Hi, @JACKYLUO1991

I can train successfully after revising the code according to the author. For your problem, I don't know why.