Single shot detection implementation

Objectives

This work is for learning purpose and the main goal is to recreate the results of SSD300 on Pascal VOC 2007 + 2012 dataset as published in this paper.

Illustrations

Results on VOC2007 test set (mAP)

Backbone	Input Size	mAP	Model Size	Download
VGG16 by lufficc	300	77.7	101 MB	model
VGG16 by me	300	60.9	101 MB	none
VGG16 2nd attempt	300	77.2	101 MB
Mobilenet V2	320	70.5	21.9 MB	model
EfficientNet-B3	300	78.3	47.7 MB	model

Samples

Dependencies

Python 3
PyTorch 1.6.0
OpenCV 4.3.0
albumentations 0.4.6

Notes

Many thanks to a very detailed tutorial by sgrvinod
The training took place in Google Colab runtime so big appreciation to Google for the generous offer of free GPUs
Data:
Training set: VOC07+12 trainval set
Test set: VOC07 test set,... actually, I'm using this test set as my validation set and the rest for training
Annotation: all .xml annotation files was parsed in to one big json file.
Data augmentation: I use mainly albumentations and OpenCV, and there were a few differences from the paper, the zoom out operation only shrinks the image by a factor of at most 2.5 times and also discards any object that has the area of its bounding box smaller than 150px. I believe that objects with that low resolution are really difficult for the model to learn and to detect.
My implementation of the original SSD300 performs quite poorly due to some reasons, i believe, like I didn't apply custom L2Norm to the conv4_3 layer as in the paper, my modeling code was kinda messy, not really well structured and probably messed up somewhere that I didn't even notice, I also didn't use pretrained weights for faster and more reliable learning, and probably ran into some gradient vanishing/exploding situtation, which really likely to happen with VGG.
Tried mixed precision O1 level on Google Colab's Tesla T4 GPU but no significant improvement in training speed.
In order to experience faster training, I thought I could try some light weight backbones so I employed MobilenetV2 and EfficientNet-B3 but training time didn't improve much. Turns out there was a bottleneck in data loading. There wasn't enough computing resource to feed data to the GPU, since Colab only provide CPU with 1 or 2 core. So I had to turn to optimize my augmentation and dataloader code to tackle it.
In the SSD model with EfficientNetB3 as the backbone, I used pretrained weight from lufficc's SSD implementation and included some layers that replicate the idea of Feature Pyramid Networks. which probably improved the net's performance on smaller objects. The performance improved from 73.9 mAP (lufficc's implementation) to 78.3 mAP.

To do

Replace the backbone with something else like Resnet-50, 101, Denset-201, SE-ResneXt-101
Try backbone like MobilenetV2 or EfficientNet for faster training :))
Feature Pyramid Networks (FPN)
Focal loss
Experiment with more augmentations (CutMix)

Reference

a-PyTorch-Tutorial-to-Object-Detection - sgrvinod
SSD - High quality, fast, modular reference implementation of SSD in PyTorch - lufficc
pytorch-retinanet - kuangliu
EfficientNet-PyTorch - A PyTorch implementation of EfficientNet - lukemelas

About

a lousy reimplementation of single shot detection in pytorch for learning purpose

https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection

MIT License

Languages

Language:Jupyter Notebook 53.3%Language:Python 46.7%