Training does not converge for resnet50-ssd on pascal VOC dataset

Question

Training does not converge for resnet50-ssd on pascal VOC dataset

kristellmarisse opened this issue 8 years ago · comments

I am traing SSD-Resnet50 on pascal VOC dataset. SInce I have a smaller GPU (gtx960 4gb), I reduced the batch size to train. The training loss started at 14 and after 7k iterations it went down to 7. But after that the loss doesn't seem to reduce. Is it because of changing the batch size ?

Jay Mahadeokar · Answer 1 · Tue Aug 09 2016 23:32:41 GMT+0800 (China Standard Time)

What is your batch size? I get best results with batch size of 32 (8 per gpu * 4 gpus in parallel), also found that batch size as low as 14 also converges, though results are not best. I would also look at running avg training loss to see whats happening, see the training plot here.

Kristell · Answer 2 · Wed Aug 10 2016 19:44:30 GMT+0800 (China Standard Time)

Thank you for the leads. My batch size was only 2 (that was the best I can squeeze into my GPU memory). Is it ok if I increase the batch size by modifying the iter_size parameter in solver.prototxt. I usually use this trick in py-faster-rcnn.

Jay Mahadeokar · Answer 3 · Thu Aug 11 2016 01:42:50 GMT+0800 (China Standard Time)

@kristellmarisse I have not tried that setting. Maybe you could try using a smaller network for bigger batch size? See few other resnet models shared here pretrained on imagenet, which give decent top-1 accuracy. The # params field in the comparison tables will influence the model size.

Kristell · Answer 4 · Thu Aug 11 2016 14:21:32 GMT+0800 (China Standard Time)

Thank you for sharing more models.

Kristell · Answer 5 · Thu Aug 11 2016 15:02:16 GMT+0800 (China Standard Time)

By the way, can you share me your GPU specs on which you trained the Resnet+SSD?

Jay Mahadeokar · Answer 6 · Fri Aug 12 2016 01:23:13 GMT+0800 (China Standard Time)

I think its K80 it has 11GB memory.