18520339 / finding-similar-images

Finding similar images from image URLs using ImageHash

Home Page:https://www.youtube.com/watch?v=G3kVp-01nn8

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Finding similar images

Introduction

  • In my Data Science project, my team had to collect images through many kinds of Search Engines for creating dataset and we chose Google Sheets for assigning labeling tasks to each member because of its convenient.

  • There are lots of similar images when crawling from the Internet, this will result in biases in the dataset. Here is my solution to filter similar images for the Data Preparation step.

Implementation

  1. Get image urls from Search Engines. I have a repo for that here

  2. Copy + paste these urls to Google Sheets. Here, we can see how similar images arranged next to each other

  3. Connect to Google Sheets using Python

  4. If just using 1 hash value, some images will be said to be the same even if they are different. Therefore, we decided to caculate 3 hash values for each 2 images:

    • Average hashing (ahash)
    • Perceptual hashing (phash)
    • Difference hashing (dhash)

  1. If the distances of 2 in these 3 values tell 2 images are similar (≤ different points) then arrange these images next to each other

    distances = [ahash0 - ahash1, phash0 - phash1, dhash0 - dhash1]
    diff_results = sum(dist < args['diff'] for dist in distances)
    
    if diff_results >= 2:
        print(f'|--Similar with url {idx1 + 1}: {url1}')
  2. Decide what images to keep and begin labeling

Usage

  1. Install libraries: pip install -r requirements.txt

  2. Sort similar images in Google Sheets:

  • Example: python sort_similar.py -s "example" -w "Sheet1" -r "B2:C" -a credentials.json
usage: sort_similar.py [-h] -s SPREADSHEET -w WORKSHEET -r RANGE -a AUTH [-d DIFF]

optional arguments:
-h, --help                                    show this help message and exit
-s SPREADSHEET, --spreadsheet SPREADSHEET     spreadsheet name
-w WORKSHEET, --worksheet WORKSHEET           worksheet name
-r RANGE, --range RANGE                       updated range
-a AUTH, --auth AUTH                          credentials file
-d DIFF, --diff DIFF                          different points
  1. Download images from urls in Google Sheets:
  • Example: python download_images.py -s "example" -w "Sheet1" -r "B2:C" -a credentials.json -o images/
usage: download_images.py [-h] -s SPREADSHEET -w WORKSHEET -r RANGE -a AUTH -o OUT

optional arguments:
-h, --help                                    show this help message and exit
-s SPREADSHEET, --spreadsheet SPREADSHEET     spreadsheet name
-w WORKSHEET, --worksheet WORKSHEET           worksheet name
-r RANGE, --range RANGE                       updated range
-a AUTH, --auth AUTH                          credentials file
-o OUT, --out OUT                             path to images directory

Reference

About

Finding similar images from image URLs using ImageHash

https://www.youtube.com/watch?v=G3kVp-01nn8


Languages

Language:Python 100.0%