clip computer-vision deep-learning nlp search-objects search-on-image yolo

Search Between the Objects - SBO

Search between the objects in an image, and cut the region of the detected object.

About this project

CLIP model was proposed by the OpenAI company, to understand the semantic similarity between images and texts. It's used for preform zero-shot learning tasks, to find objects in an image based on an input query.

CLIP pre-trains an image encoder and a text encoder to predict which images were paired with which texts in our dataset. We then use this behavior to turn CLIP into a zero-shot classifier. We convert all of a dataset’s classes into captions such as “a photo of a dog” and predict the class of the caption CLIP estimates best pairs with a given image.

Also, YOLOv5 was used in the first step of the method, to detect the location of the objects in an image.

Demo

Demo is ready!

(Sometimes, the Streamlit website may crash! because models are heavy for it.)

Notebook

Run this notebook on Google Colab and test on your images! (It works both on CPU and GPU)

Open in nbviewer

Limitations

Obviously object detector model only can find object classes learned from the COCO dataset. So if your results are not related to your query, maybe the object you want is not in the COCO classes.