Jigsaw Puzzle Piece Image Segmentation & Placement Prediction

Motivation

Despite having only rudimentary exposure to image classification and no exposure to semantic/instance segmentation, I found myself gravitating towards instance segmentation. Inspired by this writeup about using Matterport's Mask R-CNN for pixel-level balloon identification, I started mulling ideas with my good friend and founder of the codebug bootcamp. A couple of rabbit holes later, we stumbled upon a couple of puzzles under the coffee table and came up with an initial business question:

Can you take a photo of a puzzle piece and the photo of its box and predict where in the puzzle it belongs?

As an avid puzzler growing up, I thought this would be a fun challenge that had several checkpoints (and stretch goals) that allowed me to gauge the feasibility of the task along the way and adjust as needed.

Project Organization

Dataset Creation

The dataset was created by taking pieces from 5 puzzles and photographing them in expected situations (in the puzzle's box, in one's hand, on a table, etc.). Given the business application of the desired solution, it did not make sense to photograph these pieces in random situations. For the training and validation sets, the neural network requires the object outlines to be annotated and classified, so I used the VGG Image Annotator (VIA) to create these in JSON.

Annotation Example

5 puzzles (2x 100-piece, 2x 200-piece, 1x 1000-piece)
93 annotated training images
22 annotated validation images
Ever increasing amount of test images

The goal of predicting the location of a puzzle piece was broken down into four parts:

Part I: Instance Segmentation

In computer vision, image identification can be broken down into at least four tiers:

Classification: Identifying if there is, or is not, a puzzle piece in the image.
Semantic Segmentation: Identifying all the pixels of puzzle piece(s) in the image.
Object Detection: Quantifying and Locating the number of pieces in an image (accounting for overlaps)
Instance Segmentation: Quantifying and locating all instances of a puzzle piece, at the pixel level, in the image.

The first part of the project focused on image segmentation and being able to accurately classify and locate a puzzle piece in an image. To do this, I used a Mask R-CNN pretrained on the COCO dataset, and provided the model with the annotated dataset. A handful of models were trained (freezing the base layers) using different configuration parameters for varying epochs, and the models were evaluated based on their val_mask_rcnn_loss and Intersect over Union (IoU) scores. Intersect over Union measures the percent overlap between the ground truth mask/bounding box (annotated) and predicted mask/bounding box. Despite adjusting weights in an effort to increase the IoU scores, the average val_mrcnn_mask_loss plateaued at around 14%.

Average Mask IoU: 84%

Average Box IoU: 87%

IoU Comparison

In the final model, a couple of the configuration/design choices included:

Pretrained Weights: COCO
Epochs: 20
DETECTION_MAX_INSTANCES: 1 (expecting only 1 puzzle piece per image)
MINI_MASK_SHAPE: (224, 224)
USE_MINI_MASK: True
DETECTION_MIN_CONFIDENCE: 0.90
STEPS_PER_EPOCH: 100
VALIDATION_STEPS: 50
LOSS_WEIGHTS: {'rpn_class_loss': 1.0, 'rpn_bbox_loss': 1.05, 'mrcnn_class_loss': 1.0, 'mrcnn_bbox_loss': 1.0, 'mrcnn_mask_loss': 1.2}

And, in "inferior" models, a couple of the configuration/design choices included:

Epochs: 5, 10, 15
USE_MINI_MASK: False (caused RAM overload - was hoping to increase mask accuracy by avoiding minimization)
MINI_MASK_SHAPE: (28,28), (56,56) - searching for better mask IoUs
LOSS_WEIGHTS: Increase/Decrease of mrcnn_mask_loss (vs others), and mrcnn_bbox_loss/class_loss
DETECTION_MAX_INSTANCES: 10 - originally, had annotated images of multiple pieces, but removed to increase single-piece mask IoU

Part II: Segmentation Extraction

After tuning the model and getting accurate (enough) predicted masks, the next step was to create a new image with only the puzzle piece. Using the bounding box coordinates, only the region of interest (ROI) was extracted from the original input image. With the piece isolated and now consuming most of the picture, the next step was to remove the predicted background by changing their pixels black, adding an alpha channel, and setting the background transparent. To do this, I used the predicted mask, and applied these changes to all the pixels in the ROI image not encompassed by the object's mask.

Segmentation Extraction Example

Part III: Feature Matching

With the extracted puzzle piece and an image of the puzzle, the next step involved passing both images to a feature matching algorithm. Given the expectation that the pieces in the images were randomly photographed, the piece's rotation and tilt with respect to the camera had to be considered unknowns. As a result, sliding window algorithms were incompatible, and ultimately, I settled on using the SIFT (Scale Invariant Feature Transform) algorithm to detect features. SIFT can be rather slow when dealing with large images (which was the case), and even slower when dealing with more complex puzzle scenes (ex. ocean floor with hundreds of animals = more features to sift through).

Feature Matching Example

Part IV: Location Prediction

If enough features were matched between the piece and the box, and the piece's location could be determined, the next step was to draw the outline of the piece where it belongs in the puzzle. To do this, OpenCV's findHomography and perspectiveTransform were used to find the orientation (scale, rotation, skew, etc.) of the piece in the box, apply this perspective transformation to the vertices of the piece's outline, and draw them on the box.

Location Prediction Example

Links to Dataset, Model Checkpoints, Results, Etc.

Dataset - 93 training images, 22 validation, handful of test images, and their boxes
Final Model Checkpoint Weights - for loading pretrained weights for inference on validation/test images
Segmented Puzzle Piece Results - some of the resulting puzzle pieces with backgrounds removed
Feature Matching Results - some of the feature matches between puzzle piece and box

Future Work & Takeaways

Areas of Improvement

Sometimes, when applying the perspective transformation on the puzzle piece's contours, the piece's outline can be distorted into several lines across the image. This is something that is most definitely fixable given a little more time.
When feature matching fails to find enough keypoints, consider applying a second feature matching algorithm or applying SIFT again with a new set of parameters.
While SIFT is scale invariant, one way to improve the feature matching would be to, if possible, adjust its invariance to scale. That is, a puzzle piece can only be 1/100, 1/200, 1/500, 1/1000 the size of the box image (depending on the number of pieces in the puzzle). If I could provide SIFT with the knowledge that it does not have to consider scales in which the puzzle piece could not exist, the algorithm should, at the very least, have a faster performance. Additionally, being cognizant of maximum scale for the piece, it might also extract more features.
Again, when feature matching fails to find enough keypoints, consider passing in multiple images of the same piece into SIFT to see if different orientations and scales provide more successful results. Though I had not anticipated the need to take photos of the same piece in different orientations/scales, this is probably the easiest to implement as it only requires more test images and adjusting the feature matching to loop through a batch of images until a successful match (or the end of the batch).

Alternative Image Segmentation Model & Process

While the final model had a respectable Mask IoU of ~84% and Box Iou of ~87%, I believe that this could break the 90% threshold by:

Requiring a photo of the backside of the puzzle piece (in addition to the front side and box)
Instead of training the model on the front sides, train the model on the backs of puzzle pieces
Pass in backside of puzzle piece into model when doing image segmentation (inference mode)
Horizontal flip the extracted ROI of the backside, so that it's flipped shape matches the outlines of the front side
Apply SIFT feature detection between the flipped, backside and the frontside piece, isolating the frontside.
Extract match from SIFT detection and create instance segmentation mask on the frontside image.
Take this final mask and apply SIFT feature detection between this and the box image.

Given that puzzle pieces have a significant amount of internal edges and features that may distract or mislead an image segmentation model, training on the backside, and then horizontally flipping the extracted mask should result in a higher intersect over union for both the mask and box predictions. Sometimes, for example, when an object was only partially contained in the piece, the model did not include this part in its prediction of the piece, and thus had a lower predicted mask IoU. That being said, I do not believe that the model's accuracy of image segmentation held back the final performance of this endeavor as much as the feature matching aspect did.

Future Work

If I can find a way to speed up the feature detection, host the final result online, with the test images and boxes so that it is interactive. The next step would be to allow piece/box image uploads so that new puzzles can be tested.
Fine tuning SIFT's scale invariance or adding a fallback feature matching algorithm to improve the rate of successful matches.

Minor Headaches

First time working in google colab, so there was a learning curve in just figuring out file navigation, shortcuts, and how to properly install and reference parts of the project. Plus, a couple crashes during model training (my fault).
Not enough exposure to or intimate knowledge of measure.find_contours, and cv2.perspectiveTransform to handle cases when the transformed piece outline broke when drawing on the final image. Time was spent here, but unsuccessfully.
Knowledge creep between classes and functions over the life of the project, and wanting to be able to access things like box_image or box_name in places where it should not necessarily live. This occurred mostly because, after getting the overall pipeline to work, I wanted to iterate across all the test and validation images by box and save them in their appropriate places.
Desire to do much more than time allowed, and having to accept certain parts of the process as "completed" even though I would have enjoyed improving/cleaning/changing them.

NileshArnaiya / puzzle-image-segmentation