FloatingPointError: Loss became infinite or NaN at iteration=167!
kevinchow1993 opened this issue · comments
dataset :coco2017
IMS_PER_BATCH:2
[08/21 15:47:57 d2.data.common]: Serializing 117266 elements to byte tensors and concatenating them all ...
[08/21 15:48:01 d2.data.common]: Serialized dataset takes 451.21 MiB
[08/21 15:48:01 d2.data.build]: Using training sampler TrainingSampler
[08/21 15:48:05 fvcore.common.checkpoint]: No checkpoint found. Initializing model from scratch
[08/21 15:48:05 d2.engine.train_loop]: Starting training from iteration 0
[08/21 15:48:06 d2.utils.events]: eta: 2:36:14 iter: 19 total_loss: 20.71 loss_cls: 18.63 loss_box_wh: 3.366 loss_center_reg: 0.3933 time: 0.0750 data_time: 0.0180 lr: 0.00039962 max_mem: 411M
[08/21 15:48:08 d2.utils.events]: eta: 2:39:49 iter: 39 total_loss: 12.1 loss_cls: 8.956 loss_box_wh: 1.814 loss_center_reg: 0.368 time: 0.0758 data_time: 0.0027 lr: 0.00079922 max_mem: 411M
[08/21 15:48:10 d2.utils.events]: eta: 2:40:21 iter: 59 total_loss: 10.76 loss_cls: 6.582 loss_box_wh: 3.11 loss_center_reg: 0.3979 time: 0.0760 data_time: 0.0024 lr: 0.0011988 max_mem: 411M
[08/21 15:48:11 d2.utils.events]: eta: 2:40:19 iter: 79 total_loss: 11.63 loss_cls: 7.729 loss_box_wh: 2.653 loss_center_reg: 0.2949 time: 0.0762 data_time: 0.0027 lr: 0.0015984 max_mem: 411M
[08/21 15:48:13 d2.utils.events]: eta: 2:39:15 iter: 99 total_loss: 11.51 loss_cls: 6.495 loss_box_wh: 2.633 loss_center_reg: 0.2932 time: 0.0757 data_time: 0.0027 lr: 0.001998 max_mem: 411M
[08/21 15:48:14 d2.utils.events]: eta: 2:39:43 iter: 119 total_loss: 17.47 loss_cls: 11.55 loss_box_wh: 2.588 loss_center_reg: 0.2923 time: 0.0757 data_time: 0.0025 lr: 0.0023976 max_mem: 411M
[08/21 15:48:16 d2.utils.events]: eta: 2:38:11 iter: 139 total_loss: 13.35 loss_cls: 8.074 loss_box_wh: 3.39 loss_center_reg: 0.267 time: 0.0755 data_time: 0.0024 lr: 0.0027972 max_mem: 411M
[08/21 15:48:17 d2.utils.events]: eta: 2:39:10 iter: 159 total_loss: 12.71 loss_cls: 8.765 loss_box_wh: 2.91 loss_center_reg: 0.2659 time: 0.0758 data_time: 0.0026 lr: 0.0031968 max_mem: 411M
ERROR [08/21 15:48:18 d2.engine.train_loop]: Exception during training:
Traceback (most recent call last):
File "/home/ma-user/work/Projects/CenterNet-better-plus/detectron2-master/detectron2/engine/train_loop.py", line 141, in train
self.run_step()
File "/home/ma-user/work/Projects/CenterNet-better-plus/detectron2-master/detectron2/engine/train_loop.py", line 244, in run_step
self._detect_anomaly(losses, loss_dict)
File "/home/ma-user/work/Projects/CenterNet-better-plus/detectron2-master/detectron2/engine/train_loop.py", line 257, in _detect_anomaly
self.iter, loss_dict
FloatingPointError: Loss became infinite or NaN at iteration=167!
loss_dict = {'loss_cls': tensor(inf, device='cuda:0', grad_fn=<MulBackward0>), 'loss_box_wh': tensor(2.2988, device='cuda:0', grad_fn=<MulBackward0>), 'loss_center_reg': tensor(0.2414, device='cuda:0', grad_fn=<MulBackward0>), 'data_time': 0.0025475993752479553}
[08/21 15:48:18 d2.engine.hooks]: Overall training speed: 165 iterations in 0:00:12 (0.0762 s / it)
[08/21 15:48:18 d2.engine.hooks]: Total training time: 0:00:12 (0:00:00 on hooks)
Traceback (most recent call last):
File "train_net.py", line 67, in <module>
args=(args,),
File "/home/ma-user/work/Projects/CenterNet-better-plus/detectron2-master/detectron2/engine/launch.py", line 62, in launch
main_func(*args)
File "train_net.py", line 55, in main
return trainer.train()
File "/home/ma-user/work/Projects/CenterNet-better-plus/detectron2-master/detectron2/engine/defaults.py", line 402, in train
super().train(self.start_iter, self.max_iter)
File "/home/ma-user/work/Projects/CenterNet-better-plus/detectron2-master/detectron2/engine/train_loop.py", line 141, in train
self.run_step()
File "/home/ma-user/work/Projects/CenterNet-better-plus/detectron2-master/detectron2/engine/train_loop.py", line 244, in run_step
self._detect_anomaly(losses, loss_dict)
File "/home/ma-user/work/Projects/CenterNet-better-plus/detectron2-master/detectron2/engine/train_loop.py", line 257, in _detect_anomaly
self.iter, loss_dict
FloatingPointError: Loss became infinite or NaN at iteration=167!
loss_dict = {'loss_cls': tensor(inf, device='cuda:0', grad_fn=<MulBackward0>), 'loss_box_wh': tensor(2.2988, device='cuda:0', grad_fn=<MulBackward0>), 'loss_center_reg': tensor(0.2414, device='cuda:0', grad_fn=<MulBackward0>), 'data_time': 0.0025475993752479553}
为啥loss会异常呢,有哪里不对吗,我只是把IMS_PER_BATCH从128改成了2,因为内存不够
可以把lr给变小试试