KD training

Question

KD training

gneworld opened this issue 3 years ago · comments

hi, I want to make a KD train with yamls/coco/centernet_res18_KD.yaml, but got error "exp_results/coco/coco_exp_R50_SGD_0.5/model_final.pth not found!", so how to get this teacher model, thanks much

cheng peng · Answer 1 · Sun Dec 06 2020 19:07:06 GMT+0800 (China Standard Time)

you can train a resnet50 model and a resnet18 model first，
and then use the KD yaml to train

gneworld · Answer 2 · Sun Dec 06 2020 22:12:54 GMT+0800 (China Standard Time)

I have trained a resnet50 model with yamls/coco/centernet_res50_coco_0.5.yaml, but only get a "exp_results/coco/coco_exp_R50_SGD_0.5/inference/instances_predictions.pth", not a model_final.pth file, so what steps am I missing?

cheng peng · Answer 3 · Mon Dec 07 2020 10:48:18 GMT+0800 (China Standard Time)

Have you end your training? The model_final.pth will be saved when the total training is end

gneworld · Answer 4 · Mon Dec 07 2020 10:55:28 GMT+0800 (China Standard Time)

[12/07 10:49:35 d2.utils.events]: eta: 3 days, 1:46:30 iter: 144619 total_loss: 6.227 loss_cls: 4.281 loss_box_wh: 1.738 loss_off_reg: 0.2514 time: 0.3063 data_time: 0.0710 lr: 0.01 max_mem: 4798M
[12/07 10:49:42 d2.utils.events]: eta: 3 days, 1:53:33 iter: 144639 total_loss: 6.227 loss_cls: 4.257 loss_box_wh: 1.671 loss_off_reg: 0.2595 time: 0.3063 data_time: 0.0763 lr: 0.01 max_mem: 4798M
[12/07 10:49:48 d2.utils.events]: eta: 3 days, 1:53:08 iter: 144659 total_loss: 6.302 loss_cls: 4.289 loss_box_wh: 1.86 loss_off_reg: 0.2583 time: 0.3063 data_time: 0.0554 lr: 0.01 max_mem: 4798M
^C[12/07 10:49:48 d2.engine.hooks]: Overall training speed: 144659 iterations in 12:18:32 (0.3063 s / it)
[12/07 10:49:48 d2.engine.hooks]: Total training time: 12:34:44 (0:16:11 on hooks)
[12/07 10:49:48 d2.utils.events]: eta: 3 days, 1:53:07 iter: 144661 total_loss: 6.351 loss_cls: 4.318 loss_box_wh: 1.863 loss_off_reg: 0.2583 time: 0.3063 data_time: 0.0610 lr: 0.01 max_mem: 4798M

I have trained 14w steps, still can not get model_final.pth

cheng peng · Answer 5 · Mon Dec 07 2020 11:07:42 GMT+0800 (China Standard Time)

It seems you will get model_final after 3 days
It's strange that your loss is too big, and what's your mAP now? It seems you modified the batch size and don't modified the base_lr, what's your yaml ?
This is my log

gneworld · Answer 6 · Mon Dec 07 2020 11:11:35 GMT+0800 (China Standard Time)

thanks for your quick reply, I missed the lr reduced by the same times as the batch size from 64 to 8