loss nan in Lvis and coco

Question

loss nan in Lvis and coco

niecongchong opened this issue 3 years ago · comments

I use 8GPU training the model, the lr is set to 0.02 following your config, but the loss is nan;
when I set the lr to 0.0025 in another issue you answered, the loss is normal.
can you give me some help, thanks

Gang Zhang · Answer 1 · Mon Aug 30 2021 11:48:14 GMT+0800 (China Standard Time)

I use 8GPU training the model, the lr is set to 0.02 following your config, but the loss is nan;
when I set the lr to 0.0025 in another issue you answered, the loss is normal.
can you give me some help, thans

Please upload your training log for more details.

Gang Zhang · Answer 2 · Tue Aug 31 2021 09:53:26 GMT+0800 (China Standard Time)

I use 8GPU training the model, the lr is set to 0.02 following your config, but the loss is nan;
when I set the lr to 0.0025 in another issue you answered, the loss is normal.
can you give me some help, thans

If you use 8GPU to train the model (1 image per gpu), set the lr as 0.01, following the linear scaling rule. But this may not reproduce the results.

Nick Nie · Answer 3 · Tue Aug 31 2021 16:29:06 GMT+0800 (China Standard Time)

This works.
But why the loss of refinemask directly be nan when lr changes.

Nick Nie · Answer 4 · Tue Aug 31 2021 16:34:02 GMT+0800 (China Standard Time)

I mean why the loss is nan when lr!=value (value following the linear scaling rule)；
From my experience, this (change lr) will just affects final performance, but will not cause the loss to nan

Nick Nie · Answer 5 · Tue Aug 31 2021 16:50:42 GMT+0800 (China Standard Time)

dataset_type = 'LVISV1Dataset'

data_root = 'Dataset/lvis_v1/'

data = dict(

samples_per_gpu=2,

workers_per_gpu=1,

train=dict(

type='ClassBalancedDataset',

oversample_thr=0.001,

dataset=dict(

type='LVISV1Dataset',

ann_file=

'Dataset/lvis_v1/annotations/lvis_v1_train.json',

img_prefix='Dataset/lvis_v1/',

pipeline=[

dict(type='LoadImageFromFile', to_float32=True),

dict(type='LoadAnnotations', with_bbox=True, with_mask=True),

dict(

type='Resize',

img_scale=[(1600, 400), (1600, 1400)],

multiscale_mode='range',

keep_ratio=True),

dict(type='RandomFlip', flip_ratio=0.5),

dict(

type='Normalize',

mean=[123.675, 116.28, 103.53],

std=[58.395, 57.12, 57.375],

to_rgb=True),

dict(type='Pad', size_divisor=32),

dict(type='SegRescale', scale_factor=0.125),

dict(type='DefaultFormatBundle'),

dict(

type='Collect',

keys=['img', 'gt_bboxes', 'gt_labels', 'gt_masks'])

])),

val=dict(

type='LVISV1Dataset',

ann_file=

'Dataset/lvis_v1/annotations/lvis_v1_val.json',

img_prefix='Dataset/lvis_v1/',

pipeline=[

dict(type='LoadImageFromFile'),

dict(

type='MultiScaleFlipAug',

img_scale=(1600, 1400),

flip=False,

transforms=[

dict(type='Resize', keep_ratio=True),

dict(type='RandomFlip', flip_ratio=0.5),

dict(

type='Normalize',

mean=[123.675, 116.28, 103.53],

std=[58.395, 57.12, 57.375],

to_rgb=True),

dict(type='Pad', size_divisor=32),

dict(type='ImageToTensor', keys=['img']),

dict(type='Collect', keys=['img'])

])

]),

test=dict(

type='LVISV1Dataset',

ann_file=

'Dataset/lvis_v1/annotations/lvis_v1_val.json',

img_prefix='Dataset/lvis_v1/',

pipeline=[

dict(type='LoadImageFromFile'),

dict(

type='MultiScaleFlipAug',

img_scale=(1600, 1400),

flip=False,

transforms=[

dict(type='Resize', keep_ratio=True),

dict(type='RandomFlip', flip_ratio=0.5),

dict(

type='Normalize',

mean=[123.675, 116.28, 103.53],

std=[58.395, 57.12, 57.375],

to_rgb=True),

dict(type='Pad', size_divisor=32),

dict(type='ImageToTensor', keys=['img']),

dict(type='Collect', keys=['img'])

])

]))

evaluation = dict(interval=20, metric=['bbox', 'segm'])

2021-08-31 16:43:59,624 - mmdet - INFO - Epoch [1][50/7674] lr: 1.978e-03, eta: 3 days, 15:43:43, time: 2.058, data_time: 0.870, memory: 11711, loss_rpn_cls: 351.2067, loss_rpn_bbox: 133.3206, loss_cls: nan, acc: 70.0557, loss_bbox: nan, loss_mask: nan, loss_semantic: nan, loss: nan

[08/31 16:43:59] mmdet INFO: Epoch [1][50/7674] lr: 1.978e-03, eta: 3 days, 15:43:43, time: 2.058, data_time: 0.870, memory: 11711, loss_rpn_cls: 351.2067, loss_rpn_bbox: 133.3206, loss_cls: nan, acc: 70.0557, loss_bbox: nan, loss_mask: nan, loss_semantic: nan, loss: nan

2021-08-31 16:44:57,794 - mmdet - INFO - Epoch [1][100/7674] lr: 3.976e-03, eta: 2 days, 20:38:46, time: 1.164, data_time: 0.158, memory: 11711, loss_rpn_cls: 0.6590, loss_rpn_bbox: 0.1426, loss_cls: nan, acc: 0.0002, loss_bbox: nan, loss_mask: nan, loss_semantic: nan, loss: nan

[08/31 16:44:57] mmdet INFO: Epoch [1][100/7674] lr: 3.976e-03, eta: 2 days, 20:38:46, time: 1.164, data_time: 0.158, memory: 11711, loss_rpn_cls: 0.6590, loss_rpn_bbox: 0.1426, loss_cls: nan, acc: 0.0002, loss_bbox: nan, loss_mask: nan, loss_semantic: nan, loss: nan

2021-08-31 16:45:56,030 - mmdet - INFO - Epoch [1][150/7674] lr: 5.974e-03, eta: 2 days, 14:16:34, time: 1.164, data_time: 0.090, memory: 11711, loss_rpn_cls: 0.6108, loss_rpn_bbox: 0.1433, loss_cls: nan, acc: 0.0000, loss_bbox: nan, loss_mask: nan, loss_semantic: nan, loss: nan

[08/31 16:45:56] mmdet INFO: Epoch [1][150/7674] lr: 5.974e-03, eta: 2 days, 14:16:34, time: 1.164, data_time: 0.090, memory: 11711, loss_rpn_cls: 0.6108, loss_rpn_bbox: 0.1433, loss_cls: nan, acc: 0.0000, loss_bbox: nan, loss_mask: nan, loss_semantic: nan, loss: nan

2021-08-31 16:46:59,519 - mmdet - INFO - Epoch [1][200/7674] lr: 7.972e-03, eta: 2 days, 12:12:41, time: 1.270, data_time: 0.118, memory: 11711, loss_rpn_cls: 0.5552, loss_rpn_bbox: 0.1334, loss_cls: nan, acc: 0.0002, loss_bbox: nan, loss_mask: nan, loss_semantic: nan, loss: nan

[08/31 16:46:59] mmdet INFO: Epoch [1][200/7674] lr: 7.972e-03, eta: 2 days, 12:12:41, time: 1.270, data_time: 0.118, memory: 11711, loss_rpn_cls: 0.5552, loss_rpn_bbox: 0.1334, loss_cls: nan, acc: 0.0002, loss_bbox: nan, loss_mask: nan, loss_semantic: nan, loss: nan

Gang Zhang · Answer 6 · Tue Aug 31 2021 17:00:21 GMT+0800 (China Standard Time)

I mean why the loss is nan when lr!=value (value following the linear scaling rule)；
From my experience, this (change lr) will just affects final performance, but will not cause the loss to nan

When trained on the LVIS, the losses are easy to be nan, I don't know the exact reason actually.

And I found that using gradient clipping could avoid this problem, so I did not try to find the reason.

optimizer_config = dict(delete=True, grad_clip=dict(max_norm=35, norm_type=2))

Nick Nie · Answer 7 · Tue Aug 31 2021 17:04:50 GMT+0800 (China Standard Time)

Nick Nie commented 3 years ago

Nick Nie · Answer 8 · Tue Aug 31 2021 17:08:53 GMT+0800 (China Standard Time)

thanks. I will try that (optimizer_config = dict(delete=True, grad_clip=dict(max_norm=35, norm_type=2)))

Nick Nie · Answer 9 · Tue Aug 31 2021 19:25:45 GMT+0800 (China Standard Time)

It really works，thanks.