VDIGPKU / DynamicDet

[CVPR 2023] DynamicDet: A Unified Dynamic Architecture for Object Detection

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

运行python train_step2.py 报错,

LeonNerd opened this issue · comments

您好,运行程序训练自己的数据出现了以下错误。
运行python train_step2.py 报错,错误如下:
Traceback (most recent call last):
File "train_step2.py", line 551, in
train(hyp, opt, device, tb_writer)
File "train_step2.py", line 160, in train
optimizer.load_state_dict(ckpt['optimizer'])
File "/opt/conda/lib/python3.8/site-packages/torch/optim/optimizer.py", line 141, in load_state_dict
raise ValueError("loaded state dict has a different number of "
ValueError: loaded state dict has a different number of parameter groups
错误发生在加载train_step1.py生成的模型,
相应命令是:python train_step2.py --weight runs/train/exp5/weights/last.pt --name dy-yolov7-step2
定位到代码是发现pg0长度为空,导致静态字典长度不匹配为2,而加载train_step1.py生成的模型的长度为3:
pg0, pg1, pg2 = [], [], [] # optimizer parameter groups
for k, v in model.named_modules():
if 'router' in k:
if hasattr(v, 'bias') and isinstance(v.bias, nn.Parameter):
pg2.append(v.bias) # biases
if isinstance(v, nn.BatchNorm2d):
pg0.append(v.weight) # no decay
elif hasattr(v, 'weight') and isinstance(v.weight, nn.Parameter):
pg1.append(v.weight) # apply decay

if opt.adam:
    optimizer = optim.AdamW(pg1, lr=hyp['lr0'], weight_decay=hyp['weight_decay'], betas=(hyp['momentum'], 0.999))  # adjust beta1 to momentum
else:
    optimizer = optim.SGD(pg1, lr=hyp['lr0'], weight_decay=hyp['weight_decay'], momentum=hyp['momentum'], nesterov=True)
if len(pg0):
    optimizer.add_param_group({'params': pg0, 'weight_decay': 0})  # add pg0 without weight_decay
if len(pg2):
    optimizer.add_param_group({'params': pg2, 'weight_decay': 0})  # add pg2 (biases)
logger.info('Optimizer groups: %g .bias, %g conv.weight, %g other' % (len(pg2), len(pg1), len(pg0)))
del pg0, pg1, pg2

谢谢!

您好!非常荣幸您可以尝试我们的工作!

在第一步训练后,转至第二阶段训练时,无需载入optimizerstate_dict。这是因为这两阶段优化的参数并不一致(第一阶段为detector,第二阶段为router)。

因此,在第一阶段训练完毕后,会对last和best两个weight做清理、剔除(即Link)。

您可以尝试调用strip_optimizer对您的weight进行处理,之后再进行第二阶段的训练。这时ckpt['optimizer']将会是None,不会执行载入。

非常感谢您的及时解答,可能是shut down了进程,第一阶段没有训练完毕。目前问题已经解决,现在可以运行train_step2.py进行训练,并继续验证后续流程。
同时很感谢您的开源,对我目前的研究内容提供了新的思路。谢谢