lyuwenyu / RT-DETR

[CVPR 2024] Official RT-DETR (RTDETR paddle pytorch), Real-Time DEtection TRansformer, DETRs Beat YOLOs on Real-time Object Detection. 🔥 🔥 🔥

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

torchrun使用自己的数据 卡死

myalos opened this issue · comments

commented

感谢分享代码和论文,我用自己的数据进行训练,torchrun --nproc_per_node=2,训练coco的话是没有问题,改成自己的数据集就会卡死,我给训练的过程加了log,其内容如下,到最后一行的时候就不在有输出就卡死了,不知道是为什么。还有一点比较神奇的是,我在misc的dist.py的reduct_dict里面加了打印的代码,结果一条都没有被打印出来。我自己debug了半天,没有找到原因,想请求一下帮助,非常感谢

2024-03-25 17:23:55.925 | INFO     | src.solver.det_engine:train_one_epoch:60 - model done
2024-03-25 17:23:55.962 | INFO     | src.solver.det_engine:train_one_epoch:62 - loss compute done
2024-03-25 17:23:55.971 | INFO     | src.solver.det_engine:train_one_epoch:62 - loss compute done
2024-03-25 17:23:56.301 | INFO     | src.solver.det_engine:train_one_epoch:67 - loss backward done
2024-03-25 17:23:56.306 | INFO     | src.solver.det_engine:train_one_epoch:67 - loss backward done
2024-03-25 17:23:56.367 | INFO     | src.solver.det_engine:train_one_epoch:73 - optimizer done
2024-03-25 17:23:56.371 | INFO     | src.solver.det_engine:train_one_epoch:73 - optimizer done
2024-03-25 17:23:56.400 | INFO     | src.solver.det_engine:train_one_epoch:78 - ema done
2024-03-25 17:23:56.400 | INFO     | src.solver.det_engine:train_one_epoch:82 - loss reduce done
2024-03-25 17:23:56.403 | INFO     | src.solver.det_engine:train_one_epoch:78 - ema done
2024-03-25 17:23:56.404 | INFO     | src.solver.det_engine:train_one_epoch:82 - loss reduce done
2024-03-25 17:23:56.401 | INFO     | src.solver.det_engine:train_one_epoch:83 - tensor(23.1628, device='cuda:1')
2024-03-25 17:23:56.405 | INFO     | src.solver.det_engine:train_one_epoch:90 - metric update start
2024-03-25 17:23:56.405 | INFO     | src.solver.det_engine:train_one_epoch:93 - metric update done
2024-03-25 17:23:56.404 | INFO     | src.solver.det_engine:train_one_epoch:83 - tensor(23.1628, device='cuda:0')
2024-03-25 17:23:56.406 | INFO     | src.solver.det_engine:train_one_epoch:90 - metric update start
2024-03-25 17:23:56.407 | INFO     | src.solver.det_engine:train_one_epoch:93 - metric update done
Epoch: [0]  [  0/680]  eta: 0:42:45  lr: 0.000010  loss: 23.1628 (23.1628)  loss_bbox: 0.9969 (0.9969)  loss_bbox_aux_0: 1.0977 (1.0977)  loss_bbox_aux_1: 0.9761 (0.9761)  loss_bbox_aux_2: 1.0618 (1.0618)  loss_bbox_dn_0: 0.5916 (0.5916)  loss_bbox_dn_1: 0.3490 (0.3490)  loss_bbox_dn_2: 0.3517 (0.3517)  loss_giou: 1.6710 (1.6710)  loss_giou_aux_0: 1.8911 (1.8911)  loss_giou_aux_1: 1.7543 (1.7543)  loss_giou_aux_2: 1.9855 (1.9855)  loss_giou_dn_0: 0.8708 (0.8708)  loss_giou_dn_1: 0.6687 (0.6687)  loss_giou_dn_2: 0.5373 (0.5373)  loss_vfl: 0.5661 (0.5661)  loss_vfl_aux_0: 0.2631 (0.2631)  loss_vfl_aux_1: 0.3563 (0.3563)  loss_vfl_aux_2: 0.1520 (0.1520)  loss_vfl_dn_0: 1.8495 (1.8495)  loss_vfl_dn_1: 2.4024 (2.4024)  loss_vfl_dn_2: 2.7697 (2.7697)  time: 3.7722  data: 1.1741  max mem: 4504
2024-03-25 17:23:56.440 | INFO     | src.solver.det_engine:train_one_epoch:38 - data loaded
2024-03-25 17:23:56.443 | INFO     | src.solver.det_engine:train_one_epoch:38 - data loaded
2024-03-25 17:23:56.606 | INFO     | src.solver.det_engine:train_one_epoch:60 - model done
2024-03-25 17:23:56.607 | INFO     | src.solver.det_engine:train_one_epoch:60 - model done
2024-03-25 17:23:56.642 | INFO     | src.solver.det_engine:train_one_epoch:62 - loss compute done
2024-03-25 17:23:56.642 | INFO     | src.solver.det_engine:train_one_epoch:62 - loss compute done
2024-03-25 17:23:56.821 | INFO     | src.solver.det_engine:train_one_epoch:67 - loss backward done
2024-03-25 17:23:56.850 | INFO     | src.solver.det_engine:train_one_epoch:67 - loss backward done
2024-03-25 17:23:56.869 | INFO     | src.solver.det_engine:train_one_epoch:73 - optimizer done
2024-03-25 17:23:56.871 | INFO     | src.solver.det_engine:train_one_epoch:73 - optimizer done
2024-03-25 17:23:56.892 | INFO     | src.solver.det_engine:train_one_epoch:78 - ema done
2024-03-25 17:23:56.893 | INFO     | src.solver.det_engine:train_one_epoch:78 - ema done
2024-03-25 17:23:56.893 | INFO     | src.solver.det_engine:train_one_epoch:82 - loss reduce done
2024-03-25 17:23:56.893 | INFO     | src.solver.det_engine:train_one_epoch:82 - loss reduce done
2024-03-25 17:23:56.893 | INFO     | src.solver.det_engine:train_one_epoch:83 - tensor(23.1027, device='cuda:1')
2024-03-25 17:23:56.894 | INFO     | src.solver.det_engine:train_one_epoch:83 - tensor(23.1027, device='cuda:0')
2024-03-25 17:23:56.894 | INFO     | src.solver.det_engine:train_one_epoch:90 - metric update start
2024-03-25 17:23:56.894 | INFO     | src.solver.det_engine:train_one_epoch:90 - metric update start
2024-03-25 17:23:56.895 | INFO     | src.solver.det_engine:train_one_epoch:93 - metric update done
2024-03-25 17:23:56.895 | INFO     | src.solver.det_engine:train_one_epoch:93 - metric update done
2024-03-25 17:23:56.914 | INFO     | src.solver.det_engine:train_one_epoch:38 - data loaded
2024-03-25 17:23:56.920 | INFO     | src.solver.det_engine:train_one_epoch:38 - data loaded
2024-03-25 17:23:57.101 | INFO     | src.solver.det_engine:train_one_epoch:60 - model done
2024-03-25 17:23:57.102 | INFO     | src.solver.det_engine:train_one_epoch:60 - model done
2024-03-25 17:23:57.139 | INFO     | src.solver.det_engine:train_one_epoch:62 - loss compute done
2024-03-25 17:23:57.139 | INFO     | src.solver.det_engine:train_one_epoch:62 - loss compute done
2024-03-25 17:23:57.341 | INFO     | src.solver.det_engine:train_one_epoch:67 - loss backward done
2024-03-25 17:23:57.356 | INFO     | src.solver.det_engine:train_one_epoch:67 - loss backward done
2024-03-25 17:23:57.373 | INFO     | src.solver.det_engine:train_one_epoch:73 - optimizer done
2024-03-25 17:23:57.374 | INFO     | src.solver.det_engine:train_one_epoch:73 - optimizer done
2024-03-25 17:23:57.396 | INFO     | src.solver.det_engine:train_one_epoch:78 - ema done
2024-03-25 17:23:57.396 | INFO     | src.solver.det_engine:train_one_epoch:82 - loss reduce done
2024-03-25 17:23:57.397 | INFO     | src.solver.det_engine:train_one_epoch:78 - ema done
2024-03-25 17:23:57.398 | INFO     | src.solver.det_engine:train_one_epoch:82 - loss reduce done
2024-03-25 17:23:57.396 | INFO     | src.solver.det_engine:train_one_epoch:83 - tensor(22.4627, device='cuda:0')
2024-03-25 17:23:57.398 | INFO     | src.solver.det_engine:train_one_epoch:90 - metric update start
2024-03-25 17:23:57.398 | INFO     | src.solver.det_engine:train_one_epoch:93 - metric update done
2024-03-25 17:23:57.398 | INFO     | src.solver.det_engine:train_one_epoch:83 - tensor(22.4627, device='cuda:1')
2024-03-25 17:23:57.399 | INFO     | src.solver.det_engine:train_one_epoch:90 - metric update start
2024-03-25 17:23:57.399 | INFO     | src.solver.det_engine:train_one_epoch:93 - metric update done
2024-03-25 17:23:57.418 | INFO     | src.solver.det_engine:train_one_epoch:38 - data loaded
2024-03-25 17:23:57.418 | INFO     | src.solver.det_engine:train_one_epoch:38 - data loaded
2024-03-25 17:23:57.595 | INFO     | src.solver.det_engine:train_one_epoch:60 - model done
2024-03-25 17:23:57.596 | INFO     | src.solver.det_engine:train_one_epoch:60 - model done
2024-03-25 17:23:57.633 | INFO     | src.solver.det_engine:train_one_epoch:62 - loss compute done
2024-03-25 17:23:57.635 | INFO     | src.solver.det_engine:train_one_epoch:62 - loss compute done
2024-03-25 17:23:57.849 | INFO     | src.solver.det_engine:train_one_epoch:67 - loss backward done
2024-03-25 17:23:57.885 | INFO     | src.solver.det_engine:train_one_epoch:67 - loss backward done
2024-03-25 17:23:57.895 | INFO     | src.solver.det_engine:train_one_epoch:73 - optimizer done
2024-03-25 17:23:57.902 | INFO     | src.solver.det_engine:train_one_epoch:73 - optimizer done
2024-03-25 17:23:57.918 | INFO     | src.solver.det_engine:train_one_epoch:78 - ema done
2024-03-25 17:23:57.919 | INFO     | src.solver.det_engine:train_one_epoch:82 - loss reduce done
2024-03-25 17:23:57.926 | INFO     | src.solver.det_engine:train_one_epoch:78 - ema done
2024-03-25 17:23:57.927 | INFO     | src.solver.det_engine:train_one_epoch:82 - loss reduce done
2024-03-25 17:23:57.919 | INFO     | src.solver.det_engine:train_one_epoch:83 - tensor(24.1764, device='cuda:0')
2024-03-25 17:23:57.927 | INFO     | src.solver.det_engine:train_one_epoch:90 - metric update start
2024-03-25 17:23:57.928 | INFO     | src.solver.det_engine:train_one_epoch:93 - metric update done
2024-03-25 17:23:57.927 | INFO     | src.solver.det_engine:train_one_epoch:83 - tensor(24.1764, device='cuda:1')
2024-03-25 17:23:57.928 | INFO     | src.solver.det_engine:train_one_epoch:90 - metric update start
2024-03-25 17:23:57.928 | INFO     | src.solver.det_engine:train_one_epoch:93 - metric update done
2024-03-25 17:23:57.949 | INFO     | src.solver.det_engine:train_one_epoch:38 - data loaded
2024-03-25 17:23:57.950 | INFO     | src.solver.det_engine:train_one_epoch:38 - data loaded
2024-03-25 17:23:58.132 | INFO     | src.solver.det_engine:train_one_epoch:60 - model done
2024-03-25 17:23:58.132 | INFO     | src.solver.det_engine:train_one_epoch:60 - model done
2024-03-25 17:23:58.168 | INFO     | src.solver.det_engine:train_one_epoch:62 - loss compute done
2024-03-25 17:23:58.169 | INFO     | src.solver.det_engine:train_one_epoch:62 - loss compute done
2024-03-25 17:23:58.298 | INFO     | src.solver.det_engine:train_one_epoch:67 - loss backward done
2024-03-25 17:23:58.382 | INFO     | src.solver.det_engine:train_one_epoch:67 - loss backward done
2024-03-25 17:23:58.403 | INFO     | src.solver.det_engine:train_one_epoch:73 - optimizer done
2024-03-25 17:23:58.404 | INFO     | src.solver.det_engine:train_one_epoch:73 - optimizer done
2024-03-25 17:23:58.428 | INFO     | src.solver.det_engine:train_one_epoch:78 - ema done
2024-03-25 17:23:58.429 | INFO     | src.solver.det_engine:train_one_epoch:78 - ema done
2024-03-25 17:23:58.429 | INFO     | src.solver.det_engine:train_one_epoch:82 - loss reduce done
2024-03-25 17:23:58.430 | INFO     | src.solver.det_engine:train_one_epoch:82 - loss reduce done
2024-03-25 17:23:58.429 | INFO     | src.solver.det_engine:train_one_epoch:83 - tensor(23.9161, device='cuda:0')
2024-03-25 17:23:58.430 | INFO     | src.solver.det_engine:train_one_epoch:90 - metric update start
2024-03-25 17:23:58.431 | INFO     | src.solver.det_engine:train_one_epoch:93 - metric update done
2024-03-25 17:23:58.430 | INFO     | src.solver.det_engine:train_one_epoch:83 - tensor(23.9161, device='cuda:1')
2024-03-25 17:23:58.431 | INFO     | src.solver.det_engine:train_one_epoch:90 - metric update start
2024-03-25 17:23:58.431 | INFO     | src.solver.det_engine:train_one_epoch:93 - metric update done
2024-03-25 17:23:58.451 | INFO     | src.solver.det_engine:train_one_epoch:38 - data loaded
2024-03-25 17:23:58.451 | INFO     | src.solver.det_engine:train_one_epoch:38 - data loaded
2024-03-25 17:23:58.626 | INFO     | src.solver.det_engine:train_one_epoch:60 - model done
2024-03-25 17:23:58.629 | INFO     | src.solver.det_engine:train_one_epoch:60 - model done
2024-03-25 17:23:58.645 | INFO     | src.solver.det_engine:train_one_epoch:62 - loss compute done
2024-03-25 17:23:58.665 | INFO     | src.solver.det_engine:train_one_epoch:62 - loss compute done
2024-03-25 17:23:58.858 | INFO     | src.solver.det_engine:train_one_epoch:67 - loss backward done
2024-03-25 17:23:58.859 | INFO     | src.solver.det_engine:train_one_epoch:67 - loss backward done
2024-03-25 17:23:58.874 | INFO     | src.solver.det_engine:train_one_epoch:73 - optimizer done
2024-03-25 17:23:58.877 | INFO     | src.solver.det_engine:train_one_epoch:73 - optimizer done
2024-03-25 17:23:58.899 | INFO     | src.solver.det_engine:train_one_epoch:78 - ema done
2024-03-25 17:23:58.900 | INFO     | src.solver.det_engine:train_one_epoch:82 - loss reduce done
2024-03-25 17:23:58.901 | INFO     | src.solver.det_engine:train_one_epoch:78 - ema done
2024-03-25 17:23:58.901 | INFO     | src.solver.det_engine:train_one_epoch:82 - loss reduce done
2024-03-25 17:23:58.900 | INFO     | src.solver.det_engine:train_one_epoch:83 - tensor(5.5421, device='cuda:1')
2024-03-25 17:23:58.902 | INFO     | src.solver.det_engine:train_one_epoch:90 - metric update start
2024-03-25 17:23:58.902 | INFO     | src.solver.det_engine:train_one_epoch:93 - metric update done
2024-03-25 17:23:58.928 | INFO     | src.solver.det_engine:train_one_epoch:38 - data loaded
  1. 把dataloader的num_workers改成1或者0
  2. 单卡跑一下试一下

看下能否有报错信息

commented

感谢回复,我又进行了debug,初步发现reduct_dict用的是logger.py的代码而不是dist.py中的,torchrun卡死的原因是一个gpu里面算出的output的key中带有dn,另一个gpu算出的output的key中不带dn,导致两个gpu进行reduce_all的values shape不一样。

commented

之前休息了两天,我这个case可以复现,今天开始继续找原因

commented

2024-03-28 17:32:19.679 | INFO | src.zoo.rtdetr.denoising:get_contrastive_denoising_training_group:27 - rank : 1, num_gts : [0, 0, 0, 0, 0, 0, 0, 0], max_gt_num : 0, 导致返回的是4个None,dn_meta返回的是None

改成这样试一下?@myalos

    max_gt_num = max(num_gts)

    if max_gt_num == 0:
        num_group = 1
    else:
        num_group = num_denoising // max_gt_num

    num_group = 1 if num_group == 0 else num_group

https://github.com/lyuwenyu/RT-DETR/blob/main/rtdetr_pytorch/src/zoo/rtdetr/denoising.py#L25-L30

commented

loss变成nan了

  • tensor(nan, device='cuda:1')
    2024-03-29 11:27:42.737 | INFO | src.solver.det_engine:train_one_epoch:82 - loss reduce done
    Loss is nan, stopping training
    2024-03-29 11:27:42.737 | INFO | src.solver.det_engine:train_one_epoch:83 - tensor(nan, device='cuda:0')
    {'loss_bbox': tensor(0.5481, device='cuda:0'), 'loss_bbox_aux_0': tensor(0.6342, device='cuda:0'), 'loss_bbox_aux_1': tensor(0.5393, device='cuda:0'), 'loss_bbox_aux_2': tensor(0.8465, device='cuda:0'), 'loss_bbox_dn_0': tensor(0.4950, device='cuda:0'), 'loss_bbox_dn_1': tensor(0.1605, device='cuda:0'), 'loss_bbox_dn_2': tensor(0.1340, device='cuda:0'), 'loss_giou': tensor(0.7445, device='cuda:0'), 'loss_giou_aux_0': tensor(0.9554, device='cuda:0'), 'loss_giou_aux_1': tensor(0.7342, device='cuda:0'), 'loss_giou_aux_2': tensor(1.5535, device='cuda:0'), 'loss_giou_dn_0': tensor(0.6790, device='cuda:0'), 'loss_giou_dn_1': tensor(0.3321, device='cuda:0'), 'loss_giou_dn_2': tensor(0.2559, device='cuda:0'), 'loss_vfl': tensor(2.3894, device='cuda:0'), 'loss_vfl_aux_0': tensor(1.7126, device='cuda:0'), 'loss_vfl_aux_1': tensor(2.6082, device='cuda:0'), 'loss_vfl_aux_2': tensor(0.8458, device='cuda:0'), 'loss_vfl_dn_0': tensor(nan, device='cuda:0'), 'loss_vfl_dn_1': tensor(nan, device='cuda:0'), 'loss_vfl_dn_2': tensor(nan, device='cuda:0')}
    ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1279481) of binary:
commented

收到,我去找找原因。由于周末要出门,下周一或者周二的时候进行回复

感谢回复,我又进行了debug,初步发现reduct_dict用的是logger.py的代码而不是dist.py中的,torchrun卡死的原因是一个gpu里面算出的output的key中带有dn,另一个gpu算出的output的key中不带dn,导致两个gpu进行reduce_all的values shape不一样。

同样的问题,我把reduce_dict去掉了直接

commented

感谢回复,我又进行了debug,初步发现reduct_dict用的是logger.py的代码而不是dist.py中的,torchrun卡死的原因是一个gpu里面算出的output的key中带有dn,另一个gpu算出的output的key中不带dn,导致两个gpu进行reduce_all的values shape不一样。

同样的问题,我把reduce_dict去掉了直接

收到 谢谢,我也把这个方法试一下

commented

当有一个batch里面全是no object且输出的box全空,src_boxes和target_boxes都是tensor([]),到了后面的计算loss部分

loss = F.binary_cross_entropy_with_logits(src_logits, target_score, weight=weight, reduction='none')

这个loss得到的是shape为(batchsize, 0, num_classes + 1)的[],mean(1).sum()就会变成nan
我在后面添加了一行代码

if torch.isnan(loss): return {'loss_vfl' : torch.tensor(0., device = loss.device, requires_grad = False)}

训练了一个epoch,同样的数据,没有出现卡死的现象,而且该epoch中loss是下降的。但是对这个改发不太自信(1、存在一个GPU和其他GPU不一样 不更新部分网络,而这和分布式训练不太匹配,在我的数据库中,所有GPU的网络权重应该是一样的,也就是说网络更新的时候梯度应该是一样的;2、语义上0和nan差的有点远)

commented

跑了之前的数据,和单卡训的结果差的不大(没有出现异常),再次感谢帮助