mrdbourke / nutrify

Take a photo of food and learn about it.

Home Page:https://nutrify.app

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Clean/remove duplicate images with `fastdup`

mrdbourke opened this issue · comments

Make a script to clean and remove duplicate images with fastdup - https://github.com/visual-layer/fastdup

  • This works well since they did a test across ImageNet21k (millions of images) and it worked in ~3 hours
  • Could run this script periodically to clean images whenever new images are downloaded

Did this with a notebook and removed 695/25000 (or there abouts) images, saw a slight reduction in performance but this was expected due to less data leakage between train & test sets, see the evaluation run: https://wandb.ai/mrdbourke/test_wandb_artifacts_by_reference/runs/714m0crl

Original notes (from #50) -

  • Found a library to help with image duplication thanks to hashing — https://github.com/idealo/imagededup
    Removing duplicates will help make the model more robust and prevent data from leaking from train → test set (and then giving false metrics)
  • Created a small notebook for this (07_remove_duplicates.ipynb) and it seems to work very well, found ~500/24500 images were duplicates in a few minutes and there were little samples that weren’t (after a series of quick random plots)
  • Could integrate this workflow to run over all the images every so often (or whenever new data is added to the dataset).

Next will be to turn the notebook version of this into a script