分布式多机多卡训练卡住，超时后报错

Question

分布式多机多卡训练卡住，超时后报错

z972778371 opened this issue 2 months ago · comments

程序跑完1个epoch之后，在第二轮训练过程中卡住，超时报错了
请问这个问题大概出现在哪里？
[2024-05-09 01:12:34 accelerate.tracking]: Successfully logged to TensorBoard
[rank3]:[E ProcessGroupNCCL.cpp:523] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2665572, OpType=ALLREDUCE, NumelIn=11100676, NumelOut=11100676, Timeout(ms)=600000) ran for 601336 milliseconds before timing out.
[rank1]:[E ProcessGroupNCCL.cpp:523] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2665572, OpType=ALLREDUCE, NumelIn=11100676, NumelOut=11100676, Timeout(ms)=600000) ran for 601338 milliseconds before timing out.
[rank0]:[E ProcessGroupNCCL.cpp:523] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2665572, OpType=ALLREDUCE, NumelIn=11100676, NumelOut=11100676, Timeout(ms)=600000) ran for 600851 milliseconds before timing out.
[rank0]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E ProcessGroupNCCL.cpp:1182] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2665572, OpType=ALLREDUCE, NumelIn=11100676, NumelOut=11100676, Timeout(ms)=600000) ran for 600851 milliseconds before timing out.
Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1711403382592/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7657580d87 in /home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f7604ac04d6 in /home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f7604ac3a2d in /home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f7604ac4629 in /home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdbbf4 (0x7f76506dbbf4 in /home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #5: + 0x94ac3 (0x7f7659e94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7f7659f26850 in /lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[E ProcessGroupNCCL.cpp:1182] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2665572, OpType=ALLREDUCE, NumelIn=11100676, NumelOut=11100676, Timeout(ms)=600000) ran for 601338 milliseconds before timing out.
Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1711403382592/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb489380d87 in /home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7fb4386c04d6 in /home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7fb4386c3a2d in /home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7fb4386c4629 in /home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdbbf4 (0x7fb4842dbbf4 in /home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #5: + 0x94ac3 (0x7fb48da94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7fb48db26850 in /lib/x86_64-linux-gnu/libc.so.6)

[rank3]:[E ProcessGroupNCCL.cpp:1182] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2665572, OpType=ALLREDUCE, NumelIn=11100676, NumelOut=11100676, Timeout(ms)=600000) ran for 601336 milliseconds before timing out.
Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1711403382592/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ff8f5980d87 in /home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7ff8a2ec04d6 in /home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7ff8a2ec3a2d in /home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7ff8a2ec4629 in /home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdbbf4 (0x7ff8eeadbbf4 in /home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #5: + 0x94ac3 (0x7ff8f8294ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7ff8f8326850 in /lib/x86_64-linux-gnu/libc.so.6)

[2024-05-09 01:28:29,323] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 4024 closing signal SIGTERM
[2024-05-09 01:28:31,494] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 0 (pid: 4022) of binary: /home/ubuntu/anaconda3/envs/salience_detr/bin/python
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/salience_detr/bin/accelerate", line 8, in
sys.exit(main())
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
args.func(args)
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/accelerate/commands/launch.py", line 1073, in launch_command
multi_gpu_launcher(args)
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/accelerate/commands/launch.py", line 718, in multi_gpu_launcher
distrib_run.run(args)
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

main.py FAILED

Failures:
[1]:
time : 2024-05-09_01:28:29
host : ubuntu-X640-G30
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 4023)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 4023
[2]:
time : 2024-05-09_01:28:29
host : ubuntu-X640-G30
rank : 3 (local_rank: 3)
exitcode : -6 (pid: 4025)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 4025

Root Cause (first observed failure):
[0]:
time : 2024-05-09_01:28:29
host : ubuntu-X640-G30
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 4022)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 4022

Hou Xiuquan · Answer 1 · Sat May 11 2024 00:53:35 GMT+0800 (China Standard Time)

分布式训练卡住的原因比较复杂，但主要原因都是在分布式通信时（例如这里的AllReduce操作），不同节点的数据不匹配而无法同步，某些进程会始终等不到需要的数据，直到超时退出。

一般需要在进程间同步的数据主要就是真值标签targets、损失loss_dict、梯度，这些数据的不匹配包括以下原因：

用于训练的数据集存在空标注，或者某些数据增强(例如crop)会导致空标注。遇到空标注的进程会可能会缺少部分损失，导致不同进程计算出的loss_dict不匹配。
模型会根据输入的数据而选择特定的分支进行前向传播，例如进程1走A分支、进程2走B分支，当反向传播时，不同进程的梯度就会不匹配。
训练时DataLoader迭代轮数无法整除(节点数*batch_size)，导致最后一个batch时部分进程分配不到数据，真值标签就会不匹配。
如果是多机多卡，可能节点间的数据通讯存在问题。

我单机多卡对代码测试时并没有出现卡死问题，目前也没有很好的方法去定位究竟是什么原因。能否请您提供跑的数据集、训练配置、日志文件、shell输出，以及代码是在执行到以下哪一个语句卡住的：

loss_dict = model(images, targets)，卡在这可能是数据集的问题
loss.backward()或者optimizer.step()，卡在这可能是我们模型代码的问题
accelerator.reduce(loss_dict, reduction="mean"))，卡在这，数据集或代码的问题都有可能

建议您可以尝试以下解决方法：

检查数据是否存在空标注。
使用单机多卡或不使用分布式训练，看问题是否还会出现。
在每次前向传播后，检查所有参数是否存在为None的梯度，如果有的话请记录下来反馈给我。
换一个pytorch版本。

也参考其他项目的类似issue：
IDEA-CCNL/Fengshenbang-LM#123
bubbliiiing/faster-rcnn-pytorch#9
tinyvision/DAMO-YOLO#30
bubbliiiing/deeplabv3-plus-pytorch#92
https://zhuanlan.zhihu.com/p/60054075

但这个问题很难解决，后续我会进一步调试代码，看能否复现这个问题。如果您方便的话，可以给我邮箱(xiuqhou@stu.xjtu.edu.cn)发个您的联系方式，咱们进一步沟通。

fppccc · Answer 2 · Fri Jul 19 2024 16:11:26 GMT+0800 (China Standard Time)

我也遇到了同样的问题。我是2卡并行训练，在第二轮开始时报同样的错，只能跑一个epoch

fppccc · Answer 3 · Fri Jul 19 2024 16:13:12 GMT+0800 (China Standard Time)

分布式训练卡住的原因比较复杂，但主要原因都是在分布式通信时（例如这里的AllReduce操作），不同节点的数据不匹配而无法同步，某些进程会始终等不到需要的数据，直到超时退出。

一般需要在进程间同步的数据主要就是真值标签targets、损失loss_dict、梯度，这些数据的不匹配包括以下原因：

用于训练的数据集存在空标注，或者某些数据增强(例如crop)会导致空标注。遇到空标注的进程会可能会缺少部分损失，导致不同进程计算出的loss_dict不匹配。

模型会根据输入的数据而选择特定的分支进行前向传播，例如进程1走A分支、进程2走B分支，当反向传播时，不同进程的梯度就会不匹配。

训练时DataLoader迭代轮数无法整除(节点数*batch_size)，导致最后一个batch时部分进程分配不到数据，真值标签就会不匹配。

如果是多机多卡，可能节点间的数据通讯存在问题。

我单机多卡对代码测试时并没有出现卡死问题，目前也没有很好的方法去定位究竟是什么原因。能否请您提供跑的数据集、训练配置、日志文件、shell输出，以及代码是在执行到以下哪一个语句卡住的：

loss_dict = model(images, targets)，卡在这可能是数据集的问题

loss.backward()或者optimizer.step()，卡在这可能是我们模型代码的问题

accelerator.reduce(loss_dict, reduction="mean"))，卡在这，数据集或代码的问题都有可能

建议您可以尝试以下解决方法：

检查数据是否存在空标注。

使用单机多卡或不使用分布式训练，看问题是否还会出现。

在每次前向传播后，检查所有参数是否存在为None的梯度，如果有的话请记录下来反馈给我。

换一个pytorch版本。

也参考其他项目的类似issue： IDEA-CCNL/Fengshenbang-LM#123 bubbliiiing/faster-rcnn-pytorch#9 tinyvision/DAMO-YOLO#30 bubbliiiing/deeplabv3-plus-pytorch#92 https://zhuanlan.zhihu.com/p/60054075

但这个问题很难解决，后续我会进一步调试代码，看能否复现这个问题。如果您方便的话，可以给我邮箱(xiuqhou@stu.xjtu.edu.cn)发个您的联系方式，咱们进一步沟通。

请问题主跟您有后续联系吗？是否能解决这个问题呢？我用2卡并行复现在COCO上的实验，也是同样的报错

Hou Xiuquan · Answer 4 · Fri Jul 19 2024 16:24:12 GMT+0800 (China Standard Time)

题主后面没和我联系，我也没复现出来这个问题😢能不能提供下更详细的信息，例如输出报错、train_config.py文件、pytorch版本之类的。

我用的pytorch版本是1.12.0和2.1.1，这两个版本目前都没遇到过这个问题，可以尝试用这个pytorch版本跑试一试

fppccc · Answer 5 · Fri Jul 19 2024 16:56:47 GMT+0800 (China Standard Time)

题主后面没和我联系，我也没复现出来这个问题😢能不能提供下更详细的信息，例如输出报错、train_config.py文件、pytorch版本之类的。

我用的pytorch版本是1.12.0和2.1.1，这两个版本目前都没遇到过这个问题，可以尝试用这个pytorch版本跑试一试

我的pytorch版本是1.11.0，train_config.py文件只改动了coco数据集的存放位置，输出报错和题主是一样的，只不过因为用了tmux窗口，没有办法完全复制过来。感谢解答！我去试试pytorch==1.12.0。大概明天能得到是否可行的结果~

fppccc · Answer 6 · Sat Jul 20 2024 10:51:22 GMT+0800 (China Standard Time)

题主后面没和我联系，我也没复现出来这个问题😢能不能提供下更详细的信息，例如输出报错、train_config.py文件、pytorch版本之类的。

我用的pytorch版本是1.12.0和2.1.1，这两个版本目前都没遇到过这个问题，可以尝试用这个pytorch版本跑试一试

您好！我使用pytorch 2.1.1尝试了双卡并行，还是同样的报错。我想也许不是pytorch版本的原因。如果需要的话，我还可以用pytorch1.12.0再跑一遍。由于使用tmux窗口，能得到的报错信息只有如下几行（和题主完全一样）：

in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

main.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-07-19_23:53:45
host : dell-DSS8440
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 14854)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Hou Xiuquan · Answer 7 · Sat Jul 20 2024 11:20:23 GMT+0800 (China Standard Time)

tmux窗口可以向上滚动，先按ctrl+b，再按[符号，这样应该可以看到上面更全面的报错信息。

请问每次报错位置都是在第二个epoch吗，是不是每次都是训练到同一个步数的时候报错？

不知道你方便加个好友进一步沟通吗？

fppccc · Answer 8 · Sat Jul 20 2024 12:27:24 GMT+0800 (China Standard Time)

tmux窗口可以向上滚动，先按ctrl+b，再按[符号，这样应该可以看到上面更全面的报错信息。

请问每次报错位置都是在第二个epoch吗，是不是每次都是训练到同一个步数的时候报错？

不知道你方便加个好友进一步沟通吗？

感谢教学！我现在已经用pytorch1.12.0跑上了，如果还报错会用您教的方法复制报错信息的。

根据training.log，每次报错位置确实都是在第二个epoch刚开始的时候。（还没开始就断了）

我的vx是LCWHU-0823，非常欢迎您跟我进一步沟通！

Hou Xiuquan · Answer 9 · Sat Jul 20 2024 13:34:34 GMT+0800 (China Standard Time)

不知道为什么我搜不到这个vx号😧可能是你设置了不允许搜索添加，能不能给我邮箱发个vx二维码我加你。
我的邮箱：xiuqhou@stu.xjtu.edu.cn

如果每次都是第二个epoch还没开始就断了，应该不是数据集和模型梯度不同步的问题，否则进程应该是随机在某一轮卡住。我搜了一下相关资料，找到了几个类似的回答，这几个问题都是在某个epoch开始就卡住，你可以参考看看：

bubbliiiing/faster-rcnn-pytorch#9 (comment)
bubbliiiing/deeplabv3-plus-pytorch#92 (comment)

分布式多机多卡训练卡住，超时后报错

main.py FAILED

Root Cause (first observed failure): [0]: time : 2024-05-09_01:28:29 host : ubuntu-X640-G30 rank : 0 (local_rank: 0) exitcode : -6 (pid: 4022) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 4022

in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

main.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2024-07-19_23:53:45 host : dell-DSS8440 rank : 1 (local_rank: 1) exitcode : 1 (pid: 14854) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2024-05-09_01:28:29
host : ubuntu-X640-G30
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 4022)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 4022

in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-07-19_23:53:45
host : dell-DSS8440
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 14854)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html