多卡训练中途显卡卡死

Question

多卡训练中途显卡卡死

leo-xuxl opened this issue 5 months ago · comments

Star RTDETR
请先在RTDETR主页点击star以支持本项目
Star RTDETR to help more people discover this project.

Describe the bug
在drone_detection数据集上使用多卡训练时，第一轮训练中途显卡利用率卡在100%，然后超时报错，但单卡训练正常；使用coco数据集多卡训练正常。

To Reproduce
仅修改config文件
num_classes: 5
remap_mscoco_category: False

train_dataloader:
type: DataLoader
dataset:
type: CocoDetection
img_folder: /raid/stu/datasets/drone_detection_coco/train
ann_file: /raid/stu/datasets/drone_detection_coco/annotations/train.json
transforms:
type: Compose
ops: ~
shuffle: True
batch_size: 8
num_workers: 4
drop_last: True

val_dataloader:
type: DataLoader
dataset:
type: CocoDetection
img_folder: /raid/stu/datasets/drone_detection_coco/valid
ann_file: /raid/stu/datasets/drone_detection_coco/annotations/val.json
transforms:
type: Compose
ops: ~

报错信息：
RuntimeError: NCCL communicator was aborted on rank 2. Original reason for failure was: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=700635, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809785 milliseconds before timing out

[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1456404 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1456401) of binary: /opt/conda/bin/python

123 · Answer 1 · Sat Jun 15 2024 21:35:50 GMT+0800 (China Standard Time)

我也是捣鼓这个问题好久了，请问你解决了吗？

leo-xuxl · Answer 2 · Sun Jun 23 2024 11:57:36 GMT+0800 (China Standard Time)

我也是捣鼓这个问题好久了，请问你解决了吗？

可以参考一下这个问题，#242