YOLOv1

PyTorch implementation of the YOLO architecture presented in "You Only Look Once: Unified, Real-Time Object Detection" by Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi

Methods

For the sake of convenience, PyTorch's pretrained ResNet50 architecture was used as the backbone for the model instead of Darknet. However, the detection layers at the end of the model exactly follow those described in the paper. The data was augmented by randomly scaling dimensions, shifting position, and adjusting hue/saturation values by up to 20% of their original values.

Results

Overall, the bounding boxes look convincing, though it is interesting to note that YOLOv1 has trouble detecting tightly grouped objects as well as small, distant ones. Additionally, the fully vectorized SumSquaredLoss function achieves roughly a 4x speedup in training time compared to using a for-loop to determine bounding box responsibility.

Notes

Intersection Over Union (IOU)

Area of intersection between ground truth and predicted bounding box, divided by the area of their union. The confidence score for each square in the grid is Pr(object in square) * IOU(truth, pred). If the IOU is greater than some threshold t, then the prediction is a True Positive (TP), otherwise the prediction is a False Positive (FP).

Here's a great visual from [2]:

Precision

Measures how much you can trust a Positive prediction from the model. Out of all positive predictions, calculates what proportion are correctly identified.

Recall

Measures how well the model does in finding all the positives. TP gives the number of correct finds, and FN gives the number of objects that were present but missed by the model. Together, TP + FN equals the total number of labeled objects.

References

[1] Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi. You Only Look Once: Unified, Real-Time Object Detection. arXiv:1506.02640v5 [cs.CV] 9 May 2016

[2] Towards Datascience. mAP (mean Average Precision) might confuse you!

About

PyTorch implementation of the YOLOv1 architecture presented in "You Only Look Once: Unified, Real-Time Object Detection" by Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi

Languages

Language:Python 100.0%