zgcr / SimpleAICV_pytorch_training_examples

SimpleAICV:pytorch training and testing examples.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

train on RetinaNet

blankspace415 opened this issue · comments

我训练时如果不用apex 如下:
loading annotations into memory...
Done (t=20.29s)
creating index...
index created!
loading annotations into memory...
Done (t=2.85s)
creating index...
index created!
如果用了的话 还会显示
Defaults for this optimization level are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ModuleNotFoundError("No module named 'amp_C'")
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8192.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2048.0

然后就不显示别的了 请问这是在训练还是卡住不动了 如果是卡住是什么引起的呢 我的训练环境是3080ti batch设置为2

commented

看来您已经知道这个输出代表什么状态了

所以请问这个问题是咋回事啊?我也遇到了同样的问题!

@Joejwu 运行后根目录会生成一个文件 训练信息会在里面的txt(不太记得什么类型了)即时更新,具体可以看train.py的代码

收到!谢谢提醒!俺再去看看!