Face-Detection

This app was developed by Shubham Seth and Yash Sanjeev, and is a basic face detector useful for comparing the different object detection models out there. But before you begin playing around with the app, we would like to give you a small background and origin of the various object detection algorithms out there. There are a lot of places we can begin our tale, but we find that it's appropriate to begin it in 2014, with the development of RCNN.

RCNN - 2014

Regional Convolutional Neural Network, or RCNN for short, in our opinion were the founding block of Object Detection algorithms. The idea was instead of running the CNN one by one over each square in the image and identyfying whether it contains a foreground object or background, we simply run the classifier over regions with a higher probability of having objects in them. This was achieved using selective search, an algorithm which detected blob like features in images. You can read more about it here.

Once the Selective Search returned around 2000 region proposals, we individually run a CNN classifier over all of them and try to extract the various features. In the end, an SVM is used to classify between the different objects or whether the region proposed is just background. A linear regression was also run on the bounding boxes in order to obtain a tighter bounding box.

Thus the pipeline looked something like :

Pass the image through a selective search algorithm. This results approximately 2000 region proposals.
Take each region proposal and pass a CNN over it. Extract the features from this network's final fully connected layer.
Pass each feature through an SVM and classify whether its background or a foreground object of a certain type.
If an object is detected, pass the features through a linear regressor, and obtain a tighter bounding box over the object.

However, the RCNN model had a few glaring problems:-

It was pretty slow as it required to run a CNN(AlexNet) over every region proposal generated by the selective search algorithm.
It has to train three different models separately - the CNN to generate image features, the classifier that predicts the class, and the regression model to tighten the bounding boxes. This makes the pipeline extremely hard to train.

Fast RCNN - 2015

Fast RCNN was based on a couple of genius insights by Ross Girschick, the author of the original RCNN paper. His first insight was the idea of passing the entire image through a CNN and sharing this calculations over the various regions generated by selective search. What this means is that the image region generated by selective search can be mapped to a region in the CNN features generated. This region is then passed through a RoI pool, whuch converts CNN features of each region proposal to the same size, usually using Max Pooling. Once they are of the same size, its easy to generate features for all of them and generate outputs in a single pass. Thus, instead of 2000 passes, all it took was a single pass!

His second insight was that instead of training three different models like he did in RCNN, it would be more practical and easier to train one model, combining the CNN, SVM and bounding box regressor into one, thus making the process way less tedious.

However, despite being significantly faster than its ancestor, it still had a bottleneck while running. That was the selective search algorithm. It took time to generate 2000 region proposals, and this slowed down the entire process.

Faster RCNN - 2016

In this 2016 paper, the unsupervised method of generating Region of Interests was thrown out of the window, and was replaced by supervised learning.

How exactly was this achieved? Well, a sliding window was passed over the feature map and each window was passed through a Fully Convolutional Network, and it generated scores of whether the window contained an object and the coordinates of the bounding box within it. Each of the windows is called an anchor. Then, each region proposed which is likely to be an object is passed into the Fast R-CNN to generate a classification and tightened bounding boxes.

This creative approach makes it possible for the region proposals to be generated in no time. We only wish the authors were as creative as the approach while naming the model.

Mask RCNN - 2017

While the above algorithms work to generate rectangular bounding boxes, this model takes it a step further, by generating pixel level segmentation of the object. How does it do it?

Mask R-CNN does this by adding a branch to Faster R-CNN that outputs a binary mask that says whether or not a given pixel is part of an object. The branch, is just a Fully Convolutional Network on top of a CNN based feature map. It takes the CNN feature map as input, and generates a binary mask as the output.

Mask RCNN however did also change one small thing. Instead of using RoI pool, it switched over to RoI align, which is a more accurate version of the former, and while it wasn't necessary during generation of bounding boxes, it became needed during the generation of pixel level segmentation. In short, it covers for the ignored approximation of bounding box from the feature map in FasterRCNN by using bilinear interpolation.

YOLO v3 - 2018

YOLO takes a completely different approach to object detection. It first splits the grid into 13 x 13 cells, and then a convolutional network is passed over each cell. Each cell generates 5 bounding boxes. From the bounding box, YOLO generates a confidence score and a class. If the confidence score is above a certain level, the class is predicted. The results are obtained after removing low confidence outputs and overlapping boxes. The entire pass happens in one go and since it's just a simple convolutional network, its pretty fast. In-fact, it is used for real time detection!

sanjivyash / Face-Detection