simple dedup based on perceptual hashes from https://phash.org and using FAISS to construct indexes.
First, you need to install a fork Phash library (https://github.com/mehdidc/pHash), which contains a simple extension to compute hashes from buffers instead of files, for more efficiency.
Instructions: https://github.com/facebookresearch/faiss
pip install -r requirements.txt
-
On WDS datasets:
python cli.py compute-hashes wds "<path>/{00000..41455}.tar" --batch-size=10000 --workers=64 --out-path=hashes_upstream.npz
-
On datasets supported by CLIP benchmark (https://github.com/LAION-AI/CLIP_benchmark):
python cli.py compute-hashes imagenet1k <path_root_imagenet1k> --batch-size=10000 --workers=64 --out-path=hashes_imagenet1k.npz
python cli.py build-index hashes_upstream.npz --out-index=index_upstream.pkl --out-meta=index_meta.parquet
python cli.py dupfind index_upstream.pkl hashes_imagenet1k.npz --threshold=1 --out-path=dups_imagenet1k.csv
python cli.py build-html-visualizer index_meta.parquet dups_imagenet1k.csv imagenet1k <path_root_imagenet1k>