Do i need to modify the learning rate when several gpus exploited?

Question

Do i need to modify the learning rate when several gpus exploited?

zengarden opened this issue 7 years ago · comments

hi:
In caffe, loss will be averaged by iter_size (like batch training). Will the loss be averaged in multigpu training? (e.g. averaged by the number of gpu used for training). If not, learning rate should keep the same as the lr used in single gpu. Am i right?

bests
jemmy li

Bharat Singh commented 6 years ago

yes

Bharat Singh · Answer 1 · Sun Apr 23 2017 00:18:52 GMT+0800 (China Standard Time)

you need to increase learning rate when you increase number of gpus

jemmy li · Answer 2 · Sun Apr 23 2017 13:13:59 GMT+0800 (China Standard Time)

thx. More concrete situation, if i use 8gpus, lr should be 8x compared with 1gpu (same iter_size) ?

Bharat Singh · Answer 3 · Sun Apr 23 2017 13:14:47 GMT+0800 (China Standard Time)

that worked for me, but may not always be true

jemmy li · Answer 4 · Sun Apr 23 2017 13:25:01 GMT+0800 (China Standard Time)

got it. In your coco branch, it seems that lr is still set to 1e-3 for training, while the stepsize have been set to 90000. I mean the settings in models/coco/ResNet-101/rfcn_end2end/solver_ohem.prototxt.

Bharat Singh · Answer 5 · Sun Apr 23 2017 13:30:52 GMT+0800 (China Standard Time)

I just created this repo for multi-gpu training and it was meant for 2 GPUs with 1 iter_size on PASCAL. But I suppose, step down would be too early for coco for that. Probably I did not optimize parameters for coco when I created this repo.

The soft-nms repo contains the training schedule for ms-coco which gets 35.1 mAP, where lr is set to 0.008. But again, its dataset specific and specific to 8 GPUs.

Bharat Singh · Answer 6 · Sun Apr 23 2017 13:33:03 GMT+0800 (China Standard Time)

I'll update this repo also in a month or so, so that master has all the features.

jemmy li · Answer 7 · Sun Apr 23 2017 13:47:13 GMT+0800 (China Standard Time)

awesome soft-nms repo. R-FCN in this repo got 30.8%, while soft-nms repo got 33.9%. i see that one difference between them is test set. COCO 2014 vs 2015 minival (but i think 2015 minival is same as 2014minival). and another difference is psroipooling. soft-nms use align psroipooling(proposed in mask-rcnn). does align pspooling improve 3.1%? i would like to reproduce the results given in soft-nms.

Bharat Singh · Answer 8 · Sun Apr 23 2017 13:52:22 GMT+0800 (China Standard Time)

It is not completely due to mask-rcnn's roi align. I implemented what I could understand from the paper and I was seeing around 1% improvement by fixing the alignment issue. I also reduced the RPN min size from 32 to 16. Training was done till 160k iterations. Probably training longer would help more. In my experience, test-dev gives 0.2% more for R-FCN, so you should get 35.3 on test-dev.

jemmy li · Answer 9 · Sun Apr 23 2017 14:00:17 GMT+0800 (China Standard Time)

thanks a lot.

jemmy li · Answer 10 · Sun Apr 23 2017 14:02:10 GMT+0800 (China Standard Time)

I will try to reproduce soft-nms experiments.

foralliance · Answer 11 · Fri Jun 22 2018 12:00:03 GMT+0800 (China Standard Time)

@bharatsingh430
@zengarden

I also reduced the RPN min size from 32 to 16,
does this refer to the parameters __C.TRAIN.RPN_MIN_SIZE and __C.TEST.RPN_MIN_SIZE ?
It looks like it went from 16 to 8, not from 32 to 16.

Am i right?