train error

Question

train error

121649982 opened this issue 10 months ago · comments

First of all,thank you for your great work.
When I Train with one class.I encounter a problem.the loss always 0

2023-09-07 14:27:17,951 - pyskl - INFO - Config: model = dict(
type='Recognizer3D',
backbone=dict(
type='C3D',
in_channels=17,
base_channels=32,
num_stages=3,
temporal_downsample=False),
cls_head=dict(type='I3DHead', in_channels=256, num_classes=1, dropout=0.5),
test_cfg=dict(average_clips='prob'))
dataset_type = 'PoseDataset'
ann_file = './data/nturgbd/train.pkl'
left_kp = [1, 3, 5, 7, 9, 11, 13, 15]
right_kp = [2, 4, 6, 8, 10, 12, 14, 16]
train_pipeline = [
dict(type='UniformSampleFrames', clip_len=48),
dict(type='PoseDecode'),
dict(type='PoseCompact', hw_ratio=1.0, allow_imgpad=True),
dict(type='Resize', scale=(-1, 64)),
dict(type='RandomResizedCrop', area_range=(0.56, 1.0)),
dict(type='Resize', scale=(56, 56), keep_ratio=False),
dict(
type='Flip',
flip_ratio=0.5,
left_kp=[1, 3, 5, 7, 9, 11, 13, 15],
right_kp=[2, 4, 6, 8, 10, 12, 14, 16]),
dict(type='GeneratePoseTarget', with_kp=True, with_limb=False),
dict(type='FormatShape', input_format='NCTHW_Heatmap'),
dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
dict(type='ToTensor', keys=['imgs', 'label'])
]
val_pipeline = [
dict(type='UniformSampleFrames', clip_len=48, num_clips=1),
dict(type='PoseDecode'),
dict(type='PoseCompact', hw_ratio=1.0, allow_imgpad=True),
dict(type='Resize', scale=(64, 64), keep_ratio=False),
dict(type='GeneratePoseTarget', with_kp=True, with_limb=False),
dict(type='FormatShape', input_format='NCTHW_Heatmap'),
dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
dict(type='ToTensor', keys=['imgs'])
]
test_pipeline = [
dict(type='UniformSampleFrames', clip_len=48, num_clips=10),
dict(type='PoseDecode'),
dict(type='PoseCompact', hw_ratio=1.0, allow_imgpad=True),
dict(type='Resize', scale=(64, 64), keep_ratio=False),
dict(
type='GeneratePoseTarget',
with_kp=True,
with_limb=False,
double=True,
left_kp=[1, 3, 5, 7, 9, 11, 13, 15],
right_kp=[2, 4, 6, 8, 10, 12, 14, 16]),
dict(type='FormatShape', input_format='NCTHW_Heatmap'),
dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
dict(type='ToTensor', keys=['imgs'])
]
data = dict(
videos_per_gpu=32,
workers_per_gpu=4,
test_dataloader=dict(videos_per_gpu=1),
train=dict(
type='RepeatDataset',
times=10,
dataset=dict(
type='PoseDataset',
ann_file='./data/nturgbd/train.pkl',
split='xsub_train',
pipeline=[
dict(type='UniformSampleFrames', clip_len=48),
dict(type='PoseDecode'),
dict(type='PoseCompact', hw_ratio=1.0, allow_imgpad=True),
dict(type='Resize', scale=(-1, 64)),
dict(type='RandomResizedCrop', area_range=(0.56, 1.0)),
dict(type='Resize', scale=(56, 56), keep_ratio=False),
dict(
type='Flip',
flip_ratio=0.5,
left_kp=[1, 3, 5, 7, 9, 11, 13, 15],
right_kp=[2, 4, 6, 8, 10, 12, 14, 16]),
dict(type='GeneratePoseTarget', with_kp=True, with_limb=False),
dict(type='FormatShape', input_format='NCTHW_Heatmap'),
dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
dict(type='ToTensor', keys=['imgs', 'label'])
])),
val=dict(
type='PoseDataset',
ann_file='./data/nturgbd/train.pkl',
split='xsub_val',
pipeline=[
dict(type='UniformSampleFrames', clip_len=48, num_clips=1),
dict(type='PoseDecode'),
dict(type='PoseCompact', hw_ratio=1.0, allow_imgpad=True),
dict(type='Resize', scale=(64, 64), keep_ratio=False),
dict(type='GeneratePoseTarget', with_kp=True, with_limb=False),
dict(type='FormatShape', input_format='NCTHW_Heatmap'),
dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
dict(type='ToTensor', keys=['imgs'])
]),
test=dict(
type='PoseDataset',
ann_file='./data/nturgbd/train.pkl',
split='xsub_val',
pipeline=[
dict(type='UniformSampleFrames', clip_len=48, num_clips=10),
dict(type='PoseDecode'),
dict(type='PoseCompact', hw_ratio=1.0, allow_imgpad=True),
dict(type='Resize', scale=(64, 64), keep_ratio=False),
dict(
type='GeneratePoseTarget',
with_kp=True,
with_limb=False,
double=True,
left_kp=[1, 3, 5, 7, 9, 11, 13, 15],
right_kp=[2, 4, 6, 8, 10, 12, 14, 16]),
dict(type='FormatShape', input_format='NCTHW_Heatmap'),
dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
dict(type='ToTensor', keys=['imgs'])
]))
optimizer = dict(type='SGD', lr=0.4, momentum=0.9, weight_decay=0.0003)
optimizer_config = dict(grad_clip=dict(max_norm=40, norm_type=2))
lr_config = dict(policy='CosineAnnealing', by_epoch=False, min_lr=0)
total_epochs = 24
checkpoint_config = dict(interval=1)
evaluation = dict(
interval=1, metrics=['top_k_accuracy', 'mean_class_accuracy'], topk=(1, 5))
log_config = dict(interval=20, hooks=[dict(type='TextLoggerHook')])
log_level = 'INFO'
work_dir = './work_dirs/posec3d/c3d_light_ntu60_xsub/joint'
dist_params = dict(backend='nccl')
gpu_ids = range(0, 1)

2023-09-07 14:27:17,951 - pyskl - INFO - Set random seed to 1045533513, deterministic: False
2023-09-07 14:27:18,009 - pyskl - INFO - 704 videos remain after valid thresholding
fatal: not a git repository (or any of the parent directories): .git
2023-09-07 14:27:19,134 - pyskl - INFO - Start running, host: lhc@lhc, work_dir: /home/lhc/gc8/pyskl-main/work_dirs/posec3d/c3d_light_ntu60_xsub/joint
2023-09-07 14:27:19,134 - pyskl - INFO - Hooks will be executed in the following order:
before_run:
(VERY_HIGH ) CosineAnnealingLrUpdaterHook
(NORMAL ) CheckpointHook
(VERY_LOW ) TextLoggerHook

before_train_epoch:
(VERY_HIGH ) CosineAnnealingLrUpdaterHook
(NORMAL ) DistSamplerSeedHook
(LOW ) IterTimerHook
(VERY_LOW ) TextLoggerHook

before_train_iter:
(VERY_HIGH ) CosineAnnealingLrUpdaterHook
(LOW ) IterTimerHook

after_train_iter:
(ABOVE_NORMAL) OptimizerHook
(NORMAL ) CheckpointHook
(LOW ) IterTimerHook
(VERY_LOW ) TextLoggerHook

after_train_epoch:
(NORMAL ) CheckpointHook
(VERY_LOW ) TextLoggerHook

before_val_epoch:
(NORMAL ) DistSamplerSeedHook
(LOW ) IterTimerHook
(VERY_LOW ) TextLoggerHook

before_val_iter:
(LOW ) IterTimerHook

after_val_iter:
(LOW ) IterTimerHook

after_val_epoch:
(VERY_LOW ) TextLoggerHook

after_run:
(VERY_LOW ) TextLoggerHook

2023-09-07 14:27:19,134 - pyskl - INFO - workflow: [('train', 1)], max: 24 epochs
2023-09-07 14:27:19,134 - pyskl - INFO - Checkpoints will be saved to /home/lhc/gc8/pyskl-main/work_dirs/posec3d/c3d_light_ntu60_xsub/joint by HardDiskBackend.
[W reducer.cpp:1298] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
2023-09-07 14:27:34,668 - pyskl - INFO - Epoch [1][20/220] lr: 4.000e-01, eta: 1:08:05, time: 0.777, data_time: 0.446, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000
2023-09-07 14:27:39,113 - pyskl - INFO - Epoch [1][40/220] lr: 3.999e-01, eta: 0:43:37, time: 0.222, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000
2023-09-07 14:27:43,633 - pyskl - INFO - Epoch [1][60/220] lr: 3.999e-01, eta: 0:35:31, time: 0.226, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000
2023-09-07 14:27:48,048 - pyskl - INFO - Epoch [1][80/220] lr: 3.998e-01, eta: 0:31:19, time: 0.221, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000
2023-09-07 14:27:52,572 - pyskl - INFO - Epoch [1][100/220] lr: 3.997e-01, eta: 0:28:52, time: 0.226, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000
2023-09-07 14:27:56,953 - pyskl - INFO - Epoch [1][120/220] lr: 3.995e-01, eta: 0:27:06, time: 0.219, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000
2023-09-07 14:28:01,348 - pyskl - INFO - Epoch [1][140/220] lr: 3.993e-01, eta: 0:25:49, time: 0.220, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000
2023-09-07 14:28:05,744 - pyskl - INFO - Epoch [1][160/220] lr: 3.991e-01, eta: 0:24:51, time: 0.220, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000
2023-09-07 14:28:10,142 - pyskl - INFO - Epoch [1][180/220] lr: 3.989e-01, eta: 0:24:05, time: 0.220, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000
2023-09-07 14:28:14,541 - pyskl - INFO - Epoch [1][200/220] lr: 3.986e-01, eta: 0:23:27, time: 0.220, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000
2023-09-07 14:28:18,940 - pyskl - INFO - Epoch [1][220/220] lr: 3.983e-01, eta: 0:22:55, time: 0.220, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000
2023-09-07 14:28:19,302 - pyskl - INFO - Saving checkpoint at 1 epochs
2023-09-07 14:28:32,316 - pyskl - INFO - Epoch [2][20/220] lr: 3.980e-01, eta: 0:25:28, time: 0.649, data_time: 0.427, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000
2023-09-07 14:28:36,710 - pyskl - INFO - Epoch [2][40/220] lr: 3.976e-01, eta: 0:24:49, time: 0.220, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000
2023-09-07 14:28:41,117 - pyskl - INFO - Epoch [2][60/220] lr: 3.973e-01, eta: 0:24:16, time: 0.220, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000
2023-09-07 14:28:45,533 - pyskl - INFO - Epoch [2][80/220] lr: 3.968e-01, eta: 0:23:47, time: 0.221, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000
2023-09-07 14:28:49,956 - pyskl - INFO - Epoch [2][100/220] lr: 3.964e-01, eta: 0:23:21, time: 0.221, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000
2023-09-07 14:28:54,377 - pyskl - INFO - Epoch [2][120/220] lr: 3.959e-01, eta: 0:22:57, time: 0.221, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000
2023-09-07 14:28:58,798 - pyskl - INFO - Epoch [2][140/220] lr: 3.955e-01, eta: 0:22:36, time: 0.221, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000
2023-09-07 14:29:03,223 - pyskl - INFO - Epoch [2][160/220] lr: 3.949e-01, eta: 0:22:16, time: 0.221, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000
2023-09-07 14:29:07,646 - pyskl - INFO - Epoch [2][180/220] lr: 3.944e-01, eta: 0:21:58, time: 0.221, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000
2023-09-07 14:29:12,070 - pyskl - INFO - Epoch [2][200/220] lr: 3.938e-01, eta: 0:21:42, time: 0.221, data_time: 0.000, memory: 3439, top1_acc: 1.0000, top5_acc: 1.0000, loss_cls: 0.0000, loss: 0.0000, grad_norm: 0.0000

Safwen Naimi · Answer 1 · Tue Sep 12 2023 23:26:07 GMT+0800 (China Standard Time)

Hello, Did you manage to run pyskl on a single GPU? I am trying to do so but I am encountering some errors!

121649982 · Answer 2 · Wed Sep 13 2023 20:31:16 GMT+0800 (China Standard Time)

Hello, Did you manage to run pyskl on a single GPU? I am trying to do so but I am encountering some errors!

yes,I have trained on a single GPU
what error you encountered?

Safwen Naimi · Answer 3 · Wed Sep 13 2023 21:47:01 GMT+0800 (China Standard Time)

I processed by removing the MMDistributedDataParallel wrapper and removing any distributed training hooks like DistSamplerSeedHook in tools/train.py
The problem is that I am getting the following error:
RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor
I don't know if there are any changes that need to be made in scripts other than tools/train.py.

What did you change exactly to work on single GPU?
Thanks in advance

train error

before_train_epoch: (VERY_HIGH ) CosineAnnealingLrUpdaterHook (NORMAL ) DistSamplerSeedHook (LOW ) IterTimerHook (VERY_LOW ) TextLoggerHook

before_train_iter: (VERY_HIGH ) CosineAnnealingLrUpdaterHook (LOW ) IterTimerHook

after_train_iter: (ABOVE_NORMAL) OptimizerHook (NORMAL ) CheckpointHook (LOW ) IterTimerHook (VERY_LOW ) TextLoggerHook

after_train_epoch: (NORMAL ) CheckpointHook (VERY_LOW ) TextLoggerHook

before_val_epoch: (NORMAL ) DistSamplerSeedHook (LOW ) IterTimerHook (VERY_LOW ) TextLoggerHook

before_val_iter: (LOW ) IterTimerHook

after_val_iter: (LOW ) IterTimerHook

after_val_epoch: (VERY_LOW ) TextLoggerHook

after_run: (VERY_LOW ) TextLoggerHook

before_train_epoch:
(VERY_HIGH ) CosineAnnealingLrUpdaterHook
(NORMAL ) DistSamplerSeedHook
(LOW ) IterTimerHook
(VERY_LOW ) TextLoggerHook

before_train_iter:
(VERY_HIGH ) CosineAnnealingLrUpdaterHook
(LOW ) IterTimerHook

after_train_iter:
(ABOVE_NORMAL) OptimizerHook
(NORMAL ) CheckpointHook
(LOW ) IterTimerHook
(VERY_LOW ) TextLoggerHook

after_train_epoch:
(NORMAL ) CheckpointHook
(VERY_LOW ) TextLoggerHook

before_val_epoch:
(NORMAL ) DistSamplerSeedHook
(LOW ) IterTimerHook
(VERY_LOW ) TextLoggerHook

before_val_iter:
(LOW ) IterTimerHook

after_val_iter:
(LOW ) IterTimerHook

after_val_epoch:
(VERY_LOW ) TextLoggerHook

after_run:
(VERY_LOW ) TextLoggerHook