huangyangyi / ELICIT

[ICCV 2023] One-shot Implicit Animatable Avatars with Model-based Priors

Home Page:https://huangyangyi.github.io/ELICIT/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Training issues

JACKYLUO1991 opened this issue · comments

The training process often shows 100% GPU utilization and the code gets stuck... Also, the model is very poor for hand rendering, is there a better suggestion?

For the stuck problem, please make sure that you have sufficient GPU memory and a compatible environment. We use 4 NVIDIA V100 GPUs(32G) for training in all our experiments. If you can provide some information about where or how did the code get stuck, perhaps we can help you analyze this problem. We are also working on improving this codebase for better efficiency.
Currently, since we use SMPL as a shape prior, which doesn't model hand gestures explicitly, it's still challenging for ELICIT to recover the actual geometry of hands. We will try to address this problem in our future work.

image
The display result of entering nvidia-smi when the program is stuck. This condition occurs every time it is run.

image The display result of entering nvidia-smi when the program is stuck. This condition occurs every time it is run.

Could you please tell us which line of code the program is stuck on? We have not encountered a problem like this. It seems that the program gets stuck in a DataParallel component (cnl_mlp) which only runs on secondary GPUs(1~3).

Hi,

I have the same problem. The training is stuck in Epoch 1 Iter 5420. And GPU status is the same as @JACKYLUO1991. Have you solved the problem?

image

Hi @JACKYLUO1991, according to the feedback of @hengfei-wang, the custom op grid_sample_3d causes a stuck problem. Please pull this commit and check whether it works for you. Again we thanks @hengfei-wang for his feedback!

@huangyangyi It seems that this is not the reason and the above problem still occurs.

@hengfei-wang Not really, did you find a solution? Could it be a GPU architecture problem, the author said he was using V100.

Hi, @JACKYLUO1991

I can train successfully after revising the code according to the author. For your problem, I don't know why.