lyuwenyu / RT-DETR

[CVPR 2024] Official RT-DETR (RTDETR paddle pytorch), Real-Time DEtection TRansformer, DETRs Beat YOLOs on Real-time Object Detection. 🔥 🔥 🔥

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

多卡训练中途显卡卡死

leo-xuxl opened this issue · comments

Star RTDETR
请先在RTDETR主页点击star以支持本项目
Star RTDETR to help more people discover this project.


Describe the bug
在drone_detection数据集上使用多卡训练时,第一轮训练中途显卡利用率卡在100%,然后超时报错,但单卡训练正常;使用coco数据集多卡训练正常。

To Reproduce
仅修改config文件
num_classes: 5
remap_mscoco_category: False

train_dataloader:
type: DataLoader
dataset:
type: CocoDetection
img_folder: /raid/stu/datasets/drone_detection_coco/train
ann_file: /raid/stu/datasets/drone_detection_coco/annotations/train.json
transforms:
type: Compose
ops: ~
shuffle: True
batch_size: 8
num_workers: 4
drop_last: True

val_dataloader:
type: DataLoader
dataset:
type: CocoDetection
img_folder: /raid/stu/datasets/drone_detection_coco/valid
ann_file: /raid/stu/datasets/drone_detection_coco/annotations/val.json
transforms:
type: Compose
ops: ~

报错信息:
RuntimeError: NCCL communicator was aborted on rank 2. Original reason for failure was: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=700635, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809785 milliseconds before timing out

[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1456404 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1456401) of binary: /opt/conda/bin/python

commented

我也是捣鼓这个问题好久了,请问你解决了吗?

我也是捣鼓这个问题好久了,请问你解决了吗?

可以参考一下这个问题,#242