evaluation getting stuck when using --im_eval with num_envs>1

Question

evaluation getting stuck when using --im_eval with num_envs>1

kinalmehta opened this issue 7 months ago · comments

When evaluating on amass with the following command, the code gets stuck at the tqdm progress bar:

python phc/run.py --task HumanoidImMCPGetup --cfg_env phc/data/cfg/phc_shape_mcp_iccv.yaml --cfg_train phc/data/cfg/train/rlg/im_mcp.yaml --motion_file sample_data/amass_isaac_eval.pkl --network_path output/phc_shape_mcp_iccv --test --num_envs 100 --epoch -1 --no_virtual_display --im_eval

However, it works completely fine when running with --num_envs 1.
Library versions in use

python                    3.8
torch                     2.1.1
torchaudio                2.1.1
torchgeometry             0.1.2
torchmetrics              1.2.0
torchvision               0.16.1
tqdm                      4.66.1

Zhengyi "Zen" Luo · Answer 1 · Fri Dec 15 2023 05:41:21 GMT+0800 (China Standard Time)

Looks like a multi-processing error; those can be relatively finicky. It can either be the at the data loader part or the robot creation part.

Try setting num_jobs = 1 in the motion_lib_base line? or here in humanoid

Kinal Mehta · Answer 2 · Fri Dec 15 2023 12:10:25 GMT+0800 (China Standard Time)

Yes. I did the debugging and it's exactly as you say. It is getting stuck at data loading part.

Any specific reason for this issue. Any pointers which I can refer to solve?

Zhengyi "Zen" Luo · Answer 3 · Fri Dec 15 2023 23:53:53 GMT+0800 (China Standard Time)

Try setting num_jobs = 1 in the motion_lib_base line? or here in humanoid?

Basically, disable multi-processing. How many cores machine do you use?

Kinal Mehta · Answer 4 · Sat Dec 16 2023 14:16:55 GMT+0800 (China Standard Time)

It only works after setting num_jobs=1 in both the places mentioned by you. Not making this change in either one causes the issue.

I tried this on 2 systems:

Ubuntu 20.04 in a 48-core system
Fedora 39 in a 16-core system

Edit:

Another thing I notice is that humanoid uses python native multiprocessing and motion_lib_base uses torch.multiprocessing

Could this issue be caused by combining these two?

Jinkun Cao · Answer 5 · Sun Dec 31 2023 11:01:17 GMT+0800 (China Standard Time)

@kinalmehta Not very likely. In my experience, torch.multiprocessing is a wrapper of the python multiprocessing lib by adding some customized functions and APIs and the mixture of using both typically does not cause an issue.

By replacing multiprocessing with torch.multiprocessing, can you work around this issue?

Kinal Mehta · Answer 6 · Sun Dec 31 2023 12:49:26 GMT+0800 (China Standard Time)

Hi @noahcao
Thanks for suggestion.
I tried doing that. But still the problem persists.
I'm unable to find a solution to this. The torch.multiprocessing docs mention that python implementation is deadlock free however the torch version can run into deadlocks. And no solution to this is mentioned.

Zhengyi "Zen" Luo · Answer 7 · Thu Jan 25 2024 04:01:28 GMT+0800 (China Standard Time)

For the data loading part, try use

at this line, bascially, uncomment:

mp.set_sharing_strategy('file_system')

which should fix the issue. Though using file_system has caused me problems before as well...

Zhengyi "Zen" Luo · Answer 8 · Thu Mar 07 2024 03:46:19 GMT+0800 (China Standard Time)

Does export OMP_NUM_THREADS=1 solves this issue on your ends?

Kinal Mehta · Answer 9 · Thu Mar 07 2024 12:53:38 GMT+0800 (China Standard Time)

yes!! This solved the issue.

Thanks a lot. :D