Problems to run for more than 1 worker
Soumya-dutta opened this issue · comments
Hello,
I am very new to this paradigm of parallel training, so I am probably making some rookie mistake. The issue is whenever I am trying to increase the number of workers from 1 to 2, the code hangs at the init_process_group stage.
The system that I am running on has 2 GPUs. First I tried modifying train.py itself as mentioned in the readme. Then with respect to another issue I tried running using mpirun. Both of them are just getting stuck at the mentioned stage.
With mpirun -np 2 python3 train.py I am getting the following error
Failed to create a completion queue (CQ):
Hostname: compute-0-0
Requested CQE: 16384
Error: Cannot allocate memory
Check the CQE attribute.
Open MPI has detected that there are UD-capable Verbs devices on your
system, but none of them were able to be setup properly. This may
indicate a problem on this system.
You job will continue, but Open MPI will ignore the "ud" oob component
in this run.
Hostname: compute-0-0
Distributed init: rank 0/2 - ./output.tmp/dist_init
Distributed init: rank 0/2 - ./output.tmp/dist_init
I checked for some limits of memlock as mentioned in some websites, but they are all set to UNLIMITED.
Also, if I can run without mpirun, I would prefer that.
Since I require this for one of my course project, can you please guide me as to how to run this for more than 2 workers.
Any help is appreciated.
Thanks,
Soumya
UPDATE
I was working with Pytorch version 1.7.1 which I updated to 1.10.0. Post that I set the following
export NCCL_SOCKET_IFNAME=en,eth
Now the error that is coming is as follows:
compute-0-0:188492:188492 [0] bootstrap.cc:40 NCCL WARN Bootstrap : no socket interface found
compute-0-0:188492:188492 [0] NCCL INFO init.cc:98 -> 3
compute-0-0:188492:188492 [0] NCCL INFO init.cc:150 -> 3
compute-0-0:188492:188492 [0] NCCL INFO init.cc:167 -> 3
Traceback (most recent call last):
File "run_new.py", line 44, in
train.main()
File "/home/soumyad/powersgd/train.py", line 177, in main
bits_communicated += reducer.reduce(send_buffers, grads, memories)
File "/home/soumyad/powersgd/gradient_reducers.py", line 753, in reduce
all_reduce(self.p_memory)
File "/home/soumyad/powersgd/gradient_reducers.py", line 1185, in all_reduce
return torch.distributed.all_reduce(*args, **kwargs)
File "/home/soumyad/powersgd/.powersgd/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1285, in all_reduce
work = default_pg.allreduce([tensor], opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:891, internal error, NCCL version 21.0.3
ncclInternalError: Internal check failed. This is either a bug in NCCL or due to memory corruption.
Any suggestions as to what I should try?
Hi Soumya,
Thanks for reaching out. It is a bit challenging to give specific guidance here, because the steps depend heavily your specific cluster setup. I'll try to give you a few high-level tips to get this to work.
I would first try with a simple test.py
file like the one below:
import torch
from argparse import ArgumentParser
parser = ArgumentParser()
parser.add_argument("--rank", type=int)
parser.add_argument("--num-workers", type=int)
args = parser.parse_args()
torch.distributed.init_process_group(
"nccl",
init_method="file://tmp/some_shared_file_path",
world_size=args.num_workers,
rank=args.rank
)
rank = torch.distributed.get_rank() # number of the process
num_workers = torch.distributed.get_world_size()
if rank == 0: # ('master' node)
print("number of workers", num_workers)
number = torch.tensor(rank) # try this on the GPU too: torch.tensor(rank, device='cuda')
print(f"Local number for worker {rank}: {number}")
torch.distributed.all_reduce(number)
print(f"Number after all_reduce for worker {rank}: {number}")
You have to start multiple copies of this test.py
, for example:
$ python test.py --rank 0 --num-workers 3
$ python test.py --rank 1 --num-workers 3
$ python test.py --rank 2 --num-workers 3
MPI can help you to do this in a more practical way. It starts multiple processes, and it uses environment variables to tell the process what their rank
should be. Try to use these environment variables with os.getenv
in init_process_group
.
Have a look at https://pytorch.org/docs/stable/distributed.html, and in particular the section on init_process_group
for various ways the workers can 'find' each other. You probably have to change this line in the demo script. Once you figured this out, you can use your settings in https://github.com/epfml/powersgd/blob/master/train.py#L86-L91
.
Hope this helps!
Thijs
I'll close this for now. Feel free to add more questions if they come up.