train PNet is so slow

Question

train PNet is so slow

tzhang2014 opened this issue 7 years ago · comments

when I run python example/train_P_net.py --gpus 0 , My GPU is 1070
INFO:root:Epoch[0] Batch [200] Speed: 123.25 samples/sec Train-Accuracy=0.697969
INFO:root:Epoch[0] Batch [200] Speed: 123.25 samples/sec Train-LogLoss=0.617246
INFO:root:Epoch[0] Batch [200] Speed: 123.25 samples/sec Train-BBOX_MSE=0.103584
can you help me ? this is a wrong ? Where is the mistake？thx

xiaoxiongli · Answer 1 · Mon Feb 05 2018 16:03:35 GMT+0800 (China Standard Time)

you need put your data in SSD disk

zhangtian · Answer 2 · Mon Feb 05 2018 21:32:23 GMT+0800 (China Standard Time)

@xiaoxiongli thank you, how much time in your PC, What is the configuration of your PC? thx

Linson Wang · Answer 3 · Tue Apr 24 2018 14:51:47 GMT+0800 (China Standard Time)

@tzhang2014 i also meet this problem, how did you improve it?

INFO:root:Epoch[0] Batch [200] Speed: 126.56 samples/sec Train-Accuracy=0.697195
INFO:root:Epoch[0] Batch [200] Speed: 126.56 samples/sec Train-LogLoss=0.614800
INFO:root:Epoch[0] Batch [200] Speed: 126.56 samples/sec Train-BBOX_MSE=0.106309

Linson Wang · Answer 4 · Tue Apr 24 2018 17:22:15 GMT+0800 (China Standard Time)

Only the first round is slow, the other is very fast.

Qidian213 · Answer 5 · Fri Apr 27 2018 21:10:29 GMT+0800 (China Standard Time)

You can change mxnet's environment variables to speed training ,just like cmd : export MXNET_GPU_WORKER_NTHREADS=4 (default = 2) and : export MXNET_GPU_COPY_NTHREADS=4 (default = 1) . after i did it , every thing became better

eg : i7-7700 gtx1060
INFO:root:Epoch[0] Batch [3780] Speed: 8343.78 samples/sec Accuracy=0.898810 LogLoss=0.270442 BBOX_MSE=0.015827
INFO:root:Epoch[0] Batch [3800] Speed: 9112.26 samples/sec Accuracy=0.891901 LogLoss=0.282063 BBOX_MSE=0.015802
INFO:root:Epoch[0] Batch [3820] Speed: 10172.07 samples/sec Accuracy=0.883745 LogLoss=0.303172 BBOX_MSE=0.015691
INFO:root:Epoch[0] Batch [3840] Speed: 10388.03 samples/sec Accuracy=0.878459 LogLoss=0.288958 BBOX_MSE=0.015310
INFO:root:Epoch[0] Batch [3860] Speed: 9720.13 samples/sec Accuracy=0.885983 LogLoss=0.310603 BBOX_MSE=0.015680
INFO:root:Epoch[0] Batch [3880] Speed: 9980.33 samples/sec Accuracy=0.879565 LogLoss=0.300225 BBOX_MSE=0.016198

zhangtian · Answer 6 · Wed Jun 06 2018 14:53:04 GMT+0800 (China Standard Time)

@linsoncvw After 1 epoch ,the speed is so fast. I don't understand the reason

Geoff · Answer 7 · Thu Jun 14 2018 10:45:23 GMT+0800 (China Standard Time)

Did you meet "Cannot find argument 'out_grad'" when using train_P_net.py?

EmiPark · Answer 8 · Tue Jul 03 2018 14:49:20 GMT+0800 (China Standard Time)

@geoffzhang I met the same problem,did you fix it?

左庆 · Answer 9 · Wed Oct 10 2018 14:02:07 GMT+0800 (China Standard Time)

@geoffzhang @EmiPark delete all 'out_grad=True' in core\symbol.py

崔勇 · Answer 10 · Thu Sep 05 2019 11:45:00 GMT+0800 (China Standard Time)

@geoffzhang @EmiPark delete all 'out_grad=True' in core\symbol.py
delete "out_grad = True",whether it has an impact on training?