snip-dedup

This repo is a WIP

This repo is a WIP, but the main functionalities will be:

Download de-duplicated versions of LAION-2B-en (Better versions coming soon...)
Download small indices (25-40GB) for retrieval / dataset creation / de-duplciation
Compress features using pretrained SNIP networks (for ViT-H-14, ViT-L14, ViT-B-32)
Read our research paper
Train SNIP on your CLIP features
Run a de-duplication of your dataset using our de-dup code

SNIP is a technique to compress CLIP features. It is competitive with previous works for large scale retrieval of deep features, and has some nice properties for multi-modal features. Read more about it here.

We used SNIP to perform several de-duplications of LAION-2B-en. Our latest de-duplication found roughly 700M duplicates (we define total duplicates as total samples - duplicate groups). SNIP performs well at high compression ratios and can run at very high q/s with low memory.

Install

pip install --upgrade snip-dedup

Usage

# List available commands
snip --help
snip download --help

# Download and deduplicate the 10 first shards of the dataset
snip download --start 0 --end 10

Then, you may download (deduplicated) laion2b images with the awesome img2dataset.

See the colab for a demo on search.

What is a Duplicate?

In our first iteration, we merely marked duplicates pairwise, and remove one sample from a duplicate pair (the above code downloads a binary array, for samples to remove). In our latest run, we recorded the entire adjacency matrix of duplication. For instance, suppose SNIP has labeled feature $k$ as a duplicate with feature $j$. Then $A[k,j] = A[j,k] = 1$ in the adjacency matrix. We're currently having trouble computing the full connected components of this matrix, see this issue.

If you allow connected components with only one node, Then to compute the number of "unique" samples, you simply take one from each duplicate set, say $|\mathcal{C}|$ sets, with $N$ nodes is $D := N - |\mathcal{C}|$ duplicates.

Approximate CCs of Duplicates

Currently, we have an approximation of the CC of the duplicates. During the de-duplication, we label nodes as follows. Suppose we are at node $n$, the pseudo code for one step of labeling is calculated as

labels = np.arange(0,N)
...
d,i = index.search(feats[n,:],k)
dups = get_dups(d,i) #Use adaptive threshhold on ADC (see paper)
label[dups] = resolve_labels_one_step(dups)

Where N is number of nodes (2B for L2B). Here resolve_labels_one_step will simply re-write any node that is unlabeled to be the current node $n$. This can be thought of as a tree. We then connect nodes with common ancestors with a fixed point

while True:
      label = label[label]

The labels of the above loop can be found on huggingface vitl14_labels.

Misc files (old)

We release this index for public use and exploration of the LAION-2B-en dataset.

You may find the following necessary files here:

Binary array of De-duplicated Images

SNIP index

SNIP descriptor

Other:

cumulative sizes of features (for indexing sharded files)

Finding images overfit by Stable Diffusion

By analyzing the most duplicated images, we have found several more images verbatim copied by Stable Diffusion, posing a copyright problem:

Note on False positives

We noticed many images labled as dup by SNIP but not by raw feats are in fact newar duplicates, for example:

you may check a list of (randomly sampled) detected duplicate pairs here

Semantic Search

SNIP can also be used for semantic search. At just 25GB, it still can return the same k-NN's compared to exhaustive search roughly a third of the time, over 2.15B database vectors.

Contribute

Contributions are welcome. Usually, the best way is first to open an issue to discuss things.

This python project uses the hatch project manager. Dependencies are specified inside the pyproject.toml file, and build configs inside the hatch.toml file. As such you can enter the isolated development environment with hatch shell from inside the repository.

The code should be documented following the Numpy docstring standard.

To avoid silly mistakes, the code is checked with pyright. To ensure a consistent styling, all python code is formatted with black and we use the ruff linter. Remark that these can usually get installed in your editor, such as VS Code, to view the checks directly in the code. Once you have installed them (suggested via pipx), you can check that the code is consistent with:

hatch run check           # check for mistakes via static analysis with pyright
black --check snip_dedup/ # check formatting of all python files
ruff check snip_dedup/    # check linting rules

STILL TODO:

add docs / tutorial
add tests
check max file size on CI to prevent pushing data
auto publish github action. example at https://github.com/ofek/hatch-showcase/blob/master/.github/workflows/build.yml

Citation

@misc{webster2023deduplication,
      title={On the De-duplication of LAION-2B}, 
      author={Ryan Webster and Julien Rabin and Loic Simon and Frederic Jurie},
      year={2023},
      eprint={2303.12733},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

ryanwebster90 / snip-dedup