Human Portrait Segmentation

This is the project that I created. I used MobileNetV2 and UNet architecture to create human portrait segmentation. The pretrained model has 96.1% MIoU accuracy on test set. Although it was not intended to detect animal portraits, since the dataset does not contain animals, the model learned blurring effect, contrast difference, pixel value differences, therefore, the model could work on animal portrait as well.

The pretrained model : The Pretrained Model
To test pretrained model : python test.py -i 'path_of_image.png' -m 'pretrained_model_path.hdf5'
To run real-time on webcam : python webcam.py -m 'pretrained_model_path.hdf5'

Some Test Results

Dataset and Preprocessing

The dataset link : The Dataset
The dataset originally contains 1597 images and masks. The dataset original : The Dataset Original
The given dataset was created by random cropping of an image 5 times. The new image size is 512x512x3.
The training set contains 6985 images and corresponding masks, the validation set has 500 images and masks.

Data Augmentation

Brightness Augmentation

To make the model work well on different brightness level, I used brightness augmentation. The color space of images is changed from RGB to HSV color space. “value”(brightness) layer of the image is randomized to create images with different brightness level. Then, I changed color HSV color space to RGB color space that could be used for deep learning model.

Image Quality Augmentation

Since the project aim was to create real-time semantic segmentation model, I should make the model generalizable to different camera and image quality. I used OpenCV library IMWRITE_JPEG_QUALITY function to reduce quality of the image with randomized value.

Model

I used U-Net architecture because it was easy to implement, powerful to get high accuracy and fast enough to work real-time on GPUs. U-Net architecture is vastly used in the area of biomedical image segmentation. U-Net architecture consists of two main parts. First part of the architecture encodes the image to get high level features. Second part of the architecture decodes features that got from first part of the architecture. In U-Net architecture, there are residual connections between encoder layer and decoder layers to use high level features in decoder part. Since I want to create model that could work real-time. For encoder part, I used MobileNetV2 that consists of Convolution, Bottleneck Blocks. For decoder part, I used Transposed Convolution and Upsampling layers that are inverse of the Max Pooling and Convolution operation.

Loss

For semantic segmentation tasks, the one of the most important metric is the intersection over union( IoU ) that is the ratio of intersection area of target mask and output mask over union area of target mask and output mask. Therefore, in the literature, dice loss is vastly used to calculate IoU loss of the output mask and target mask. However, I want to calculate loss of the model, not only for IoU ratio but also similarity between pixel values of the output and mask with using binary cross entropy loss that the dataset have two classes. Therefore, I used combination of two losses that “Dice loss + Binary Cross Entropy Loss”. I trained the network with only dice loss and combination of the two losses. I observed that the combination of the losses got higher accuracy and meaningful results in terms of the localizing the object.

Some Experiments and Observations

I experimented with different number of output layers, I created 5 output layers with resizing images into intermediate layers. I observed that the multiple loss and backpropagation could not converged since there are multiple local objectives. However, I assume that it will not be problem for the bigger model. The more residual connections helps to increase the IoU accuracy on test set.

ypw1996 / portrait_segmentation