pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration

Home Page:https://pytorch.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

distributed all_reduce deadlocks in v1.1

prafullasd opened this issue Β· comments

πŸ› Bug

I'm doing multi-node training (8 nodes, 8 gpu's each, NCCL backend) and am using DistributedDataParallel for syncing grads and distributed.all_reduce() calls to log losses. I recently upgraded from Pytorch v1.0 to v1.1 and after doing so, my training script hangs at a distributed.all_reduce() call. The hang doesn't occur if I downgrade pytorch to v1.0. It also hangs if I use the pytorch-nightly version.
Some observations about the deadlock that might be useful:

  1. The script deadlocks exactly after the same number of training iterations (7699). Changing the model architecture changed this number, but it's still the same for different runs of the same architecture.

  2. In all the runs, the hang occurs at an all_reduce of a single element gpu tensor created as follows:

loss = loss.item()
sum = torch.tensor(loss * batch).float().cuda() //as batch could be different on each rank
distributed.all_reduce(sum)
sum = sum.item()

All ranks complete line 3, but only ranks = {0,8,16,...,56} ie ranks with local_rank=0 complete line 4. I checked if different processes are accessing the same GPU, however each process is using it's corresponding gpu (set using torch.cuda.set_device(local_rank), though all GPU's are visible to each process). I tried adding a dist.barrier() after the all_reduce(), or using the same tensor for sum always (instead of creating a new one), however it still hangs, and always after the same number of iterations (7699).
3. The reduction within the DDP itself hasn't hung in any of my runs yet. I tried setting find_unused_parameters=True, however that didn't help.
4. I looked at NCCL debug logs using NCCL_DEBUG_SUBSYS=COLL NCCL_DEBUG=INFO, and no errors were thrown. The last calls corresponded to all reduce calls for a 1 element tensor. It's fewer calls than the previous iteration though.

To Reproduce

It takes about 7 hrs for the deadlock to happen, and it's hard for me to share the code. I'll try see if I can come up with a simpler script that reproduces the deadlock.

Environment

PyTorch version: 1.1.0
Is debug build: No
CUDA used to build PyTorch: 10.0.130

OS: Ubuntu 16.04.5 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
CMake version: version 3.11.1

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 10.0.130

GPU models and configuration:                                                                                                          GPU 0: Tesla V100-SXM2-16GB                                                                                                            GPU 1: Tesla V100-SXM2-16GB
GPU 2: Tesla V100-SXM2-16GB
GPU 3: Tesla V100-SXM2-16GB
GPU 4: Tesla V100-SXM2-16GB
GPU 5: Tesla V100-SXM2-16GB
GPU 6: Tesla V100-SXM2-16GB
GPU 7: Tesla V100-SXM2-16GB

Nvidia driver version: 410.79
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.1.4

Versions of relevant libraries:                                                                                                        [pip3] numpy==1.16.3                                                                                                                   [pip3] tcop-pytorch==0.0.0
[pip3] torch==1.1.0
[pip3] torchvision==0.2.2.post3
[conda] blas                      1.0                         mkl
[conda] mkl                       2019.3                      199
[conda] mkl_fft                   1.0.12           py36ha843d7b_0
[conda] mkl_random                1.0.2            py36hd81dba3_0
[conda] pytorch                   1.1.0           py3.6_cuda10.0.130_cudnn7.5.1_0    pytorch                                           [conda] tcop-pytorch              0.0.0                    pypi_0    pypi
[conda] torchvision               0.2.2.post3              pypi_0    pypi

I have the same problem

I have the same problem, but not sure if it happened only to reduce_all.
In addition, I use DistributedDataParallel from apex for Nvidia.

actually, I'm not sure whether the problem in reduce_all too. I had removed everything including all reducing ops from my code, except training itself (backward op, etc.), but I still get this weird deadlock that happens in ~35 hours of training

In my case I don't get a deadlock if I run the same code on a single node with 8 gpus. I also don't get a deadlock if I remove the all_reduce() call (while keeping DistributedDataParallel for grads). It probably is related to a multi-node all_reduce() call then.

Pytorch 1.1 uses nccl 2.4.2 which has a known issue of hanging with long running jobs that was fixed for 2.4.6. NVIDIA/nccl@f40ce73
Workaround is to export NCCL_LL_THRESHOLD=0.
cc @pietern, @mrshenli to bump nccl submodule.

@ngimel Thanks a lot for the workaround, I can confirm that jobs have been running for longer than 50 hrs without deadlock.
@mrshenli Thanks for updating the module so fast :) I'll run my jobs again with the latest pytorch-nightly to make sure there's no deadlocks.

I am having same bug with NCCL version 2.5.6 and torch 1.4

Env:

  • Ubuntu 18.04
  • Pytorch 1.6.0
  • CUDA 10.1

Actually, I am using Docker image gemfield/pytorch:1.6.0-devel which stated in https://github.com/DeepVAC/deepvac (same with above env), and use PyTorch DDP (by use the class DeepvacDDP in https://github.com/DeepVAC/deepvac/blob/master/deepvac/syszux_deepvac.py) to train my model, which the code worked perfect yesterday. But today when I launch the train program again, the DDP is stucked in loss.backward(), with cpu 100% and GPU 100%。
There has no code change and docker container change since yesterday, except the Ubuntu host got a system update today:

gemfield@ai03:~$ cat /var/log/apt/history.log | grep -C 3 nvidia

Start-Date: 2020-09-03  06:44:01
Commandline: /usr/bin/unattended-upgrade
Install: linux-modules-nvidia-440-5.4.0-45-generic:amd64 (5.4.0-45.49, automatic)
Upgrade: linux-modules-nvidia-440-generic-hwe-20.04:amd64 (5.4.0-42.46, 5.4.0-45.49)
End-Date: 2020-09-03  06:44:33

Obviously, the nvidia driver got update from 440.64 to 440.100, and I think these info may be useful for somebody.

I encountered the same error, the program stuck with 100% GPU usage, no warnings or errors. I update nccl to 2.7.08 with pytroch 1.7, and my nvidia driver is 440.100, but the problem still there. The program stuck when I use the function torch.distributed.reduce.

Finally, I figure out why this happens. I call the torch.distributed.reduce inside a conditional code block where there is only rank-0 process executing. This leads to nccl deadlock. So if you call distributed related function, make sure all processes get the chance to run it. I just move the outside the conditional code block, the problem disappeared.

@Do

I encountered the same error, the program stuck with 100% GPU usage, no warnings or errors. I update nccl to 2.7.08 with pytroch 1.7, and my nvidia driver is 440.100, but the problem still there. The program stuck when I use the function torch.distributed.reduce.

Finally, I figure out why this happens. I call the torch.distributed.reduce inside a conditional code block where there is only rank-0 process executing. This leads to nccl deadlock. So if you call distributed related function, make sure all processes get the chance to run it. I just move the outside the conditional code block, the problem disappeared.

It happens when starting to run the code, but not in the case of this issue.

I seem to be having the same issue. The code works on 1 node but fails on 2. Training loop works fine but then I do distributed validation with an all_reduce at the end. 1 process hangs on all_reduce and breaks the code (I have only tried this on 2 nodes but the one that breaks is always gpu 0 on node 0) . Without all_reduce I have no problem. Checking the debug logs I get this warning after the all_reduce: "NCCL WARN Net : Connection closed by remote peer". This is the only warning or error I get. I have tried to see which NCCL version I have using "torch.cuda.nccl.version()" and get "2408" guessing this is 2.4.8 so the workaround posted above by @ngimel is outdated. Interestingly enough this workaround works for the situation where there are 2 nodes but only one gpu on each node. But if there are >1 gpu per node then it doesn't.

Also I am not calling this inside a conditional block as @dongdongbh was.

Anyone have any ideas?

EDIT: I am working around this now by using the reduce function (which works fine) as I don't need the data on each process anyway.

For me it was solved after removing tqdm.