Problems to run for more than 1 worker

Question

Problems to run for more than 1 worker

Soumya-dutta opened this issue 3 years ago · comments

Hello,

I am very new to this paradigm of parallel training, so I am probably making some rookie mistake. The issue is whenever I am trying to increase the number of workers from 1 to 2, the code hangs at the init_process_group stage.

The system that I am running on has 2 GPUs. First I tried modifying train.py itself as mentioned in the readme. Then with respect to another issue I tried running using mpirun. Both of them are just getting stuck at the mentioned stage.

With mpirun -np 2 python3 train.py I am getting the following error

Failed to create a completion queue (CQ):

Hostname: compute-0-0
Requested CQE: 16384
Error: Cannot allocate memory

Check the CQE attribute.

Open MPI has detected that there are UD-capable Verbs devices on your
system, but none of them were able to be setup properly. This may
indicate a problem on this system.

You job will continue, but Open MPI will ignore the "ud" oob component
in this run.

Hostname: compute-0-0

Distributed init: rank 0/2 - ./output.tmp/dist_init
Distributed init: rank 0/2 - ./output.tmp/dist_init

I checked for some limits of memlock as mentioned in some websites, but they are all set to UNLIMITED.

Also, if I can run without mpirun, I would prefer that.

Since I require this for one of my course project, can you please guide me as to how to run this for more than 2 workers.

Any help is appreciated.

Thanks,
Soumya

UPDATE

I was working with Pytorch version 1.7.1 which I updated to 1.10.0. Post that I set the following

export NCCL_SOCKET_IFNAME=en,eth

Now the error that is coming is as follows:

compute-0-0:188492:188492 [0] bootstrap.cc:40 NCCL WARN Bootstrap : no socket interface found
compute-0-0:188492:188492 [0] NCCL INFO init.cc:98 -> 3
compute-0-0:188492:188492 [0] NCCL INFO init.cc:150 -> 3
compute-0-0:188492:188492 [0] NCCL INFO init.cc:167 -> 3
Traceback (most recent call last):
File "run_new.py", line 44, in
train.main()
File "/home/soumyad/powersgd/train.py", line 177, in main
bits_communicated += reducer.reduce(send_buffers, grads, memories)
File "/home/soumyad/powersgd/gradient_reducers.py", line 753, in reduce
all_reduce(self.p_memory)
File "/home/soumyad/powersgd/gradient_reducers.py", line 1185, in all_reduce
return torch.distributed.all_reduce(*args, **kwargs)
File "/home/soumyad/powersgd/.powersgd/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1285, in all_reduce
work = default_pg.allreduce([tensor], opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:891, internal error, NCCL version 21.0.3
ncclInternalError: Internal check failed. This is either a bug in NCCL or due to memory corruption.

Any suggestions as to what I should try?

Thijs Vogels · Answer 1 · Mon Nov 29 2021 20:11:29 GMT+0800 (China Standard Time)

Hi Soumya,

Thanks for reaching out. It is a bit challenging to give specific guidance here, because the steps depend heavily your specific cluster setup. I'll try to give you a few high-level tips to get this to work.

I would first try with a simple test.py file like the one below:

import torch
from argparse import ArgumentParser

parser = ArgumentParser()
parser.add_argument("--rank", type=int)
parser.add_argument("--num-workers", type=int)
args = parser.parse_args()

torch.distributed.init_process_group(
    "nccl", 
    init_method="file://tmp/some_shared_file_path", 
    world_size=args.num_workers, 
    rank=args.rank
)

rank = torch.distributed.get_rank()  # number of the process
num_workers = torch.distributed.get_world_size()

if rank == 0:  # ('master' node)
    print("number of workers", num_workers)

number = torch.tensor(rank)  # try this on the GPU too: torch.tensor(rank, device='cuda')

print(f"Local number for worker {rank}: {number}")
torch.distributed.all_reduce(number)
print(f"Number after all_reduce for worker {rank}: {number}")

You have to start multiple copies of this test.py, for example:

$ python test.py --rank 0 --num-workers 3
$ python test.py --rank 1 --num-workers 3
$ python test.py --rank 2 --num-workers 3

MPI can help you to do this in a more practical way. It starts multiple processes, and it uses environment variables to tell the process what their rank should be. Try to use these environment variables with os.getenv in init_process_group.

Have a look at https://pytorch.org/docs/stable/distributed.html, and in particular the section on init_process_group for various ways the workers can 'find' each other. You probably have to change this line in the demo script. Once you figured this out, you can use your settings in https://github.com/epfml/powersgd/blob/master/train.py#L86-L91.

Hope this helps!
Thijs

Thijs Vogels · Answer 2 · Mon Nov 29 2021 20:13:19 GMT+0800 (China Standard Time)

I'll close this for now. Feel free to add more questions if they come up.