zgcr / SimpleAICV_pytorch_training_examples

SimpleAICV:pytorch training and testing examples.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

retinanet训练问题

Joejwu opened this issue · comments

您好,想问一下,为啥我在训练retinanet的时候总是执行一段出现几个warning之后就自动停下来了,而且也不报错;我一开始以为是用来apex的问题,设置为false之后还是自动停下来了;后面我给换成多卡的也是同样的,请问大佬知道是为啥嘛?
`root@container-ab78119f3c-c31dcd5b:~/SimpleAICV-pytorch-ImageNet-COCO-training-master/detection_training/coco/res50_retinanet_retinaresize800# sh train.sh
======================1======================
No pretrained model file!
loading annotations into memory...
Done (t=16.43s)
creating index...
index created!
Dataset Size:117266
Dataset Class Num:80
loading annotations into memory...
Done (t=0.51s)
creating index...
index created!
Dataset Size:5000
Dataset Class Num:80
======================2======================
Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ModuleNotFoundError("No module named 'amp_C'")
Warning: apex was installed without --cpp_ext. Falling back to Python flatten and unflatten.`

找到原因了,我给关啦!

commented

您好,想问一下,为啥我在训练retinanet的时候总是执行一段出现几个warning之后就自动停下来了,而且也不报错;我一开始以为是用来apex的问题,设置为false之后还是自动停下来了;后面我给换成多卡的也是同样的,请问大佬知道是为啥嘛? `root@container-ab78119f3c-c31dcd5b:~/SimpleAICV-pytorch-ImageNet-COCO-training-master/detection_training/coco/res50_retinanet_retinaresize800# sh train.sh ======================1====================== No pretrained model file! loading annotations into memory... Done (t=16.43s) creating index... index created! Dataset Size:117266 Dataset Class Num:80 loading annotations into memory... Done (t=0.51s) creating index... index created! Dataset Size:5000 Dataset Class Num:80 ======================2====================== Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are: enabled : True opt_level : O1 cast_model_type : None patch_torch_functions : True keep_batchnorm_fp32 : None master_weights : None loss_scale : dynamic Processing user overrides (additional kwargs that are not None)... After processing overrides, optimization options are: enabled : True opt_level : O1 cast_model_type : None patch_torch_functions : True keep_batchnorm_fp32 : None master_weights : None loss_scale : dynamic Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ModuleNotFoundError("No module named 'amp_C'") Warning: apex was installed without --cpp_ext. Falling back to Python flatten and unflatten.`

hi,很高兴你已经解决了问题,我补充一下,当使用apex时,出现类似以下信息时:
======================2======================
Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ModuleNotFoundError("No module named 'amp_C'")
Warning: apex was installed without --cpp_ext. Falling back to Python flatten and unflatten.

模型此时已经在训练了,只是使用apex时并不会在terminal窗口中刷新log,所有log记录在log/train.info.log中。由于train_config.py中print_interval = 100,每过100个batch才会在train.info.log中更新一次信息。如果你想快速看到log信息,可以将这个值设为1。

commented

另外补充一下,无论是否使用apex,log/train.info.log中都会记录所有训练log,一般来说,只要terminal命令没有中断,就是在正常训练。

呜呜!收到!感谢大佬!

您好,我还想问一下关于batchsize与学习率对应的问题!我看mmdetection有个学习率自动缩放的操作,
_比如:在 4 块 GPU 并且每张 GPU 上有 2 张图片的情况下 lr=0.01,那么在 16 块 GPU 并且每张 GPU 上有 4 张图片的情况下, LR 会自动缩放至lr=0.08。_
所以,基于您的开源代码训练retinanet,是否也存在这个学习率与bs之间的对应关系呢?因为我现在用的gpu只能设置bs为2或4,不知道用您之前默认的学习率0.0001是不是还合适;另一方面自己跑完完整的训练时间实在太长了,所以来这边打扰一下您!

commented

目前需要自己调整学习率。

好嘞!收到!谢谢大佬!