Failed in multiple GPU training

Question

Failed in multiple GPU training

EvW1998 opened this issue 7 months ago · comments

I could train with a single GPU, but when I try to run with multiple GPU by running dist_train.sh, the program stopped without reporting anything.

My dist_train.sh is like this:

CUDA_VISIBLE_DEVICES=0,1 nohup python3 -m torch.distributed.launch --nproc_per_node=2 --master_port 29501 train.py --launcher pytorch > log.txt&

The log.txt shows like this:

/usr/local/miniconda3/envs/pcdt/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torch.distributed.run.
Note that --use_env is set by default in torch.distributed.run.
If your script expects --local_rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

warnings.warn(
WARNING:torch.distributed.run:*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

Feels like something wrong with distributed, any ideas? Thanks

fish · Answer 1 · Mon Apr 08 2024 16:44:07 GMT+0800 (China Standard Time)

Hi, @EvW1998. I also meet this question. Do you solve it?