ZhengyiLuo / PHC

Official Implementation of the ICCV 2023 paper: Perpetual Humanoid Control for Real-time Simulated Avatars

Home Page:https://zhengyiluo.github.io/PHC/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

evaluation getting stuck when using --im_eval with num_envs>1

kinalmehta opened this issue · comments

When evaluating on amass with the following command, the code gets stuck at the tqdm progress bar:

python phc/run.py --task HumanoidImMCPGetup --cfg_env phc/data/cfg/phc_shape_mcp_iccv.yaml --cfg_train phc/data/cfg/train/rlg/im_mcp.yaml --motion_file sample_data/amass_isaac_eval.pkl --network_path output/phc_shape_mcp_iccv --test --num_envs 100 --epoch -1 --no_virtual_display --im_eval

However, it works completely fine when running with --num_envs 1.
Library versions in use

python                    3.8
torch                     2.1.1
torchaudio                2.1.1
torchgeometry             0.1.2
torchmetrics              1.2.0
torchvision               0.16.1
tqdm                      4.66.1

Looks like a multi-processing error; those can be relatively finicky. It can either be the at the data loader part or the robot creation part.

Try setting num_jobs = 1 in the motion_lib_base line? or here in humanoid

Yes. I did the debugging and it's exactly as you say. It is getting stuck at data loading part.

Any specific reason for this issue. Any pointers which I can refer to solve?

Try setting num_jobs = 1 in the motion_lib_base line? or here in humanoid?

Basically, disable multi-processing. How many cores machine do you use?

It only works after setting num_jobs=1 in both the places mentioned by you. Not making this change in either one causes the issue.

I tried this on 2 systems:

  1. Ubuntu 20.04 in a 48-core system
  2. Fedora 39 in a 16-core system

Edit:

Another thing I notice is that humanoid uses python native multiprocessing and motion_lib_base uses torch.multiprocessing

Could this issue be caused by combining these two?

@kinalmehta Not very likely. In my experience, torch.multiprocessing is a wrapper of the python multiprocessing lib by adding some customized functions and APIs and the mixture of using both typically does not cause an issue.

By replacing multiprocessing with torch.multiprocessing, can you work around this issue?

Hi @noahcao
Thanks for suggestion.
I tried doing that. But still the problem persists.
I'm unable to find a solution to this. The torch.multiprocessing docs mention that python implementation is deadlock free however the torch version can run into deadlocks. And no solution to this is mentioned.

For the data loading part, try use

at this line, bascially, uncomment:

mp.set_sharing_strategy('file_system')

which should fix the issue. Though using file_system has caused me problems before as well...

Does export OMP_NUM_THREADS=1 solves this issue on your ends?

yes!! This solved the issue.

Thanks a lot. :D