snip-dedup
This repo is a WIP
This repo is a WIP, but the main functionalities will be:
- Download de-duplicated versions of LAION-2B-en (Better versions coming soon...)
- Download small indices (25-40GB) for retrieval / dataset creation / de-duplciation
- Compress features using pretrained SNIP networks (for ViT-H-14, ViT-L14, ViT-B-32)
- Read our research paper
- Train SNIP on your CLIP features
- Run a de-duplication of your dataset using our de-dup code
SNIP is a technique to compress CLIP features. It is competitive with previous works for large scale retrieval of deep features, and has some nice properties for multi-modal features. Read more about it here.
We used SNIP to perform several de-duplications of LAION-2B-en. Our latest de-duplication found roughly 700M duplicates (we define total duplicates as total samples - duplicate groups). SNIP performs well at high compression ratios and can run at very high q/s with low memory.
Install
pip install --upgrade snip-dedup
Usage
# List available commands
snip --help
snip download --help
# Download and deduplicate the 10 first shards of the dataset
snip download --start 0 --end 10
Then, you may download (deduplicated) laion2b images with the awesome img2dataset.
See the colab for a demo on search.
What is a Duplicate?
In our first iteration, we merely marked duplicates pairwise, and remove one sample from a duplicate pair (the above code downloads a binary array, for samples to remove). In our latest run, we recorded the entire adjacency matrix of duplication. For instance, suppose SNIP has labeled feature
If you allow connected components with only one node, Then to compute the number of "unique" samples, you simply take one from each duplicate set, say
Approximate CCs of Duplicates
Currently, we have an approximation of the CC of the duplicates. During the de-duplication, we label nodes as follows. Suppose we are at node
labels = np.arange(0,N)
...
d,i = index.search(feats[n,:],k)
dups = get_dups(d,i) #Use adaptive threshhold on ADC (see paper)
label[dups] = resolve_labels_one_step(dups)
Where N
is number of nodes (2B for L2B). Here resolve_labels_one_step
will simply re-write any node that is unlabeled to be the current node
while True:
label = label[label]
The labels of the above loop can be found on huggingface vitl14_labels.
Misc files (old)
We release this index for public use and exploration of the LAION-2B-en dataset.
You may find the following necessary files here:
Binary array of De-duplicated Images
Other:
cumulative sizes of features (for indexing sharded files)
Finding images overfit by Stable Diffusion
By analyzing the most duplicated images, we have found several more images verbatim copied by Stable Diffusion, posing a copyright problem:
Note on False positives
We noticed many images labled as dup by SNIP but not by raw feats are in fact newar duplicates, for example:
you may check a list of (randomly sampled) detected duplicate pairs here
Semantic Search
SNIP can also be used for semantic search. At just 25GB, it still can return the same k-NN's compared to exhaustive search roughly a third of the time, over 2.15B database vectors.
Contribute
Contributions are welcome. Usually, the best way is first to open an issue to discuss things.
This python project uses the hatch
project manager.
Dependencies are specified inside the pyproject.toml
file, and build configs inside the hatch.toml
file.
As such you can enter the isolated development environment with hatch shell
from inside the repository.
The code should be documented following the Numpy docstring standard.
To avoid silly mistakes, the code is checked with pyright. To ensure a consistent styling, all python code is formatted with black and we use the ruff linter. Remark that these can usually get installed in your editor, such as VS Code, to view the checks directly in the code. Once you have installed them (suggested via pipx), you can check that the code is consistent with:
hatch run check # check for mistakes via static analysis with pyright
black --check snip_dedup/ # check formatting of all python files
ruff check snip_dedup/ # check linting rules
STILL TODO:
- add docs / tutorial
- add tests
- check max file size on CI to prevent pushing data
- auto publish github action. example at https://github.com/ofek/hatch-showcase/blob/master/.github/workflows/build.yml
Citation
@misc{webster2023deduplication,
title={On the De-duplication of LAION-2B},
author={Ryan Webster and Julien Rabin and Loic Simon and Frederic Jurie},
year={2023},
eprint={2303.12733},
archivePrefix={arXiv},
primaryClass={cs.CV}
}