zhanggang001 / RefineMask

RefineMask: Towards High-Quality Instance Segmentation with Fine-Grained Features (CVPR 2021)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

loss nan in Lvis and coco

niecongchong opened this issue · comments

I use 8GPU training the model, the lr is set to 0.02 following your config, but the loss is nan;
when I set the lr to 0.0025 in another issue you answered, the loss is normal.
can you give me some help, thanks

I use 8GPU training the model, the lr is set to 0.02 following your config, but the loss is nan;
when I set the lr to 0.0025 in another issue you answered, the loss is normal.
can you give me some help, thans

Please upload your training log for more details.

I use 8GPU training the model, the lr is set to 0.02 following your config, but the loss is nan;
when I set the lr to 0.0025 in another issue you answered, the loss is normal.
can you give me some help, thans

If you use 8GPU to train the model (1 image per gpu), set the lr as 0.01, following the linear scaling rule. But this may not reproduce the results.

This works.
But why the loss of refinemask directly be nan when lr changes.

I mean why the loss is nan when lr!=value (value following the linear scaling rule);
From my experience, this (change lr) will just affects final performance, but will not cause the loss to nan

dataset_type = 'LVISV1Dataset'

data_root = 'Dataset/lvis_v1/'

data = dict(

samples_per_gpu=2,

workers_per_gpu=1,

train=dict(

type='ClassBalancedDataset',

oversample_thr=0.001,

dataset=dict(

type='LVISV1Dataset',

ann_file=

'Dataset/lvis_v1/annotations/lvis_v1_train.json',

img_prefix='Dataset/lvis_v1/',

pipeline=[

dict(type='LoadImageFromFile', to_float32=True),

dict(type='LoadAnnotations', with_bbox=True, with_mask=True),

dict(

type='Resize',

img_scale=[(1600, 400), (1600, 1400)],

multiscale_mode='range',

keep_ratio=True),

dict(type='RandomFlip', flip_ratio=0.5),

dict(

type='Normalize',

mean=[123.675, 116.28, 103.53],

std=[58.395, 57.12, 57.375],

to_rgb=True),

dict(type='Pad', size_divisor=32),

dict(type='SegRescale', scale_factor=0.125),

dict(type='DefaultFormatBundle'),

dict(

type='Collect',

keys=['img', 'gt_bboxes', 'gt_labels', 'gt_masks'])

])),

val=dict(

type='LVISV1Dataset',

ann_file=

'Dataset/lvis_v1/annotations/lvis_v1_val.json',

img_prefix='Dataset/lvis_v1/',

pipeline=[

dict(type='LoadImageFromFile'),

dict(

type='MultiScaleFlipAug',

img_scale=(1600, 1400),

flip=False,

transforms=[

dict(type='Resize', keep_ratio=True),

dict(type='RandomFlip', flip_ratio=0.5),

dict(

type='Normalize',

mean=[123.675, 116.28, 103.53],

std=[58.395, 57.12, 57.375],

to_rgb=True),

dict(type='Pad', size_divisor=32),

dict(type='ImageToTensor', keys=['img']),

dict(type='Collect', keys=['img'])

])

]),

test=dict(

type='LVISV1Dataset',

ann_file=

'Dataset/lvis_v1/annotations/lvis_v1_val.json',

img_prefix='Dataset/lvis_v1/',

pipeline=[

dict(type='LoadImageFromFile'),

dict(

type='MultiScaleFlipAug',

img_scale=(1600, 1400),

flip=False,

transforms=[

dict(type='Resize', keep_ratio=True),

dict(type='RandomFlip', flip_ratio=0.5),

dict(

type='Normalize',

mean=[123.675, 116.28, 103.53],

std=[58.395, 57.12, 57.375],

to_rgb=True),

dict(type='Pad', size_divisor=32),

dict(type='ImageToTensor', keys=['img']),

dict(type='Collect', keys=['img'])

])

]))

evaluation = dict(interval=20, metric=['bbox', 'segm'])

2021-08-31 16:43:59,624 - mmdet - INFO - Epoch [1][50/7674] lr: 1.978e-03, eta: 3 days, 15:43:43, time: 2.058, data_time: 0.870, memory: 11711, loss_rpn_cls: 351.2067, loss_rpn_bbox: 133.3206, loss_cls: nan, acc: 70.0557, loss_bbox: nan, loss_mask: nan, loss_semantic: nan, loss: nan

[08/31 16:43:59] mmdet INFO: Epoch [1][50/7674] lr: 1.978e-03, eta: 3 days, 15:43:43, time: 2.058, data_time: 0.870, memory: 11711, loss_rpn_cls: 351.2067, loss_rpn_bbox: 133.3206, loss_cls: nan, acc: 70.0557, loss_bbox: nan, loss_mask: nan, loss_semantic: nan, loss: nan

2021-08-31 16:44:57,794 - mmdet - INFO - Epoch [1][100/7674] lr: 3.976e-03, eta: 2 days, 20:38:46, time: 1.164, data_time: 0.158, memory: 11711, loss_rpn_cls: 0.6590, loss_rpn_bbox: 0.1426, loss_cls: nan, acc: 0.0002, loss_bbox: nan, loss_mask: nan, loss_semantic: nan, loss: nan

[08/31 16:44:57] mmdet INFO: Epoch [1][100/7674] lr: 3.976e-03, eta: 2 days, 20:38:46, time: 1.164, data_time: 0.158, memory: 11711, loss_rpn_cls: 0.6590, loss_rpn_bbox: 0.1426, loss_cls: nan, acc: 0.0002, loss_bbox: nan, loss_mask: nan, loss_semantic: nan, loss: nan

2021-08-31 16:45:56,030 - mmdet - INFO - Epoch [1][150/7674] lr: 5.974e-03, eta: 2 days, 14:16:34, time: 1.164, data_time: 0.090, memory: 11711, loss_rpn_cls: 0.6108, loss_rpn_bbox: 0.1433, loss_cls: nan, acc: 0.0000, loss_bbox: nan, loss_mask: nan, loss_semantic: nan, loss: nan

[08/31 16:45:56] mmdet INFO: Epoch [1][150/7674] lr: 5.974e-03, eta: 2 days, 14:16:34, time: 1.164, data_time: 0.090, memory: 11711, loss_rpn_cls: 0.6108, loss_rpn_bbox: 0.1433, loss_cls: nan, acc: 0.0000, loss_bbox: nan, loss_mask: nan, loss_semantic: nan, loss: nan

2021-08-31 16:46:59,519 - mmdet - INFO - Epoch [1][200/7674] lr: 7.972e-03, eta: 2 days, 12:12:41, time: 1.270, data_time: 0.118, memory: 11711, loss_rpn_cls: 0.5552, loss_rpn_bbox: 0.1334, loss_cls: nan, acc: 0.0002, loss_bbox: nan, loss_mask: nan, loss_semantic: nan, loss: nan

[08/31 16:46:59] mmdet INFO: Epoch [1][200/7674] lr: 7.972e-03, eta: 2 days, 12:12:41, time: 1.270, data_time: 0.118, memory: 11711, loss_rpn_cls: 0.5552, loss_rpn_bbox: 0.1334, loss_cls: nan, acc: 0.0002, loss_bbox: nan, loss_mask: nan, loss_semantic: nan, loss: nan

I mean why the loss is nan when lr!=value (value following the linear scaling rule);
From my experience, this (change lr) will just affects final performance, but will not cause the loss to nan

When trained on the LVIS, the losses are easy to be nan, I don't know the exact reason actually.

And I found that using gradient clipping could avoid this problem, so I did not try to find the reason.

optimizer_config = dict(delete=True, grad_clip=dict(max_norm=35, norm_type=2))

thanks. I will try that (optimizer_config = dict(delete=True, grad_clip=dict(max_norm=35, norm_type=2)))

It really works,thanks.