bharatsingh430 / py-R-FCN-multiGPU

Code for training py-faster-rcnn and py-R-FCN on multiple GPUs in caffe

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Do i need to modify the learning rate when several gpus exploited?

zengarden opened this issue · comments

hi:
In caffe, loss will be averaged by iter_size (like batch training). Will the loss be averaged in multigpu training? (e.g. averaged by the number of gpu used for training). If not, learning rate should keep the same as the lr used in single gpu. Am i right?

bests
jemmy li

you need to increase learning rate when you increase number of gpus

thx. More concrete situation, if i use 8gpus, lr should be 8x compared with 1gpu (same iter_size) ?

that worked for me, but may not always be true

got it. In your coco branch, it seems that lr is still set to 1e-3 for training, while the stepsize have been set to 90000. I mean the settings in models/coco/ResNet-101/rfcn_end2end/solver_ohem.prototxt.

I just created this repo for multi-gpu training and it was meant for 2 GPUs with 1 iter_size on PASCAL. But I suppose, step down would be too early for coco for that. Probably I did not optimize parameters for coco when I created this repo.

The soft-nms repo contains the training schedule for ms-coco which gets 35.1 mAP, where lr is set to 0.008. But again, its dataset specific and specific to 8 GPUs.

I'll update this repo also in a month or so, so that master has all the features.

awesome soft-nms repo. R-FCN in this repo got 30.8%, while soft-nms repo got 33.9%. i see that one difference between them is test set. COCO 2014 vs 2015 minival (but i think 2015 minival is same as 2014minival). and another difference is psroipooling. soft-nms use align psroipooling(proposed in mask-rcnn). does align pspooling improve 3.1%? i would like to reproduce the results given in soft-nms.

It is not completely due to mask-rcnn's roi align. I implemented what I could understand from the paper and I was seeing around 1% improvement by fixing the alignment issue. I also reduced the RPN min size from 32 to 16. Training was done till 160k iterations. Probably training longer would help more. In my experience, test-dev gives 0.2% more for R-FCN, so you should get 35.3 on test-dev.

thanks a lot.

I will try to reproduce soft-nms experiments.

@bharatsingh430
@zengarden

I also reduced the RPN min size from 32 to 16,
does this refer to the parameters __C.TRAIN.RPN_MIN_SIZE and __C.TEST.RPN_MIN_SIZE ?
It looks like it went from 16 to 8, not from 32 to 16.

Am i right?