ANN Filtered Retrieval Datasets

This repo contains a collection of datasets, inspired by ann-benchmarks for searching for similar vectors with additional filtering conditions.

Motivation

More and more applications are now using vector similarity search in their products. The task of approximate nearest neighbor (ANN) search has gone beyond the scope of academic research and the narrow circle of huge IT corporations.

In this regard, the issue of supplementing vector search with application business logic is becoming more and more relevant.

Examples and cases

It is no longer enough to simply search for similar dishes by photo, you only need to search for them in those restaurants that are in the delivery area.

It is not enough to search for all items similar by description, you also need to consider price ranges, stock availability, etc.

It's not enough to find candidates for a job position based on similar skills, you also have to consider location, level of spoken language, and seniority.

You name it.

Is it that different?

Classical approaches to ANN, and their implementations in many libraries, were usually customized for benchmarks, where the search speed among all vectors is the only comparison criterion.

Because of this, they had to sacrifice many functions that are useful in other situations: the ability to quickly delete, insert and modify stored values, as well as saving and filtering based on metadata.

Data

description	Num vectors	dim	distance	filters	link
all-MiniLM-L6-v2 ArXiv titles	2 138 591	384	Cosine	match keyword / range	link
Efficientnet encoded H&M Clothes	105 100	2048	Cosine	match keyword	link
LAION Sample encoded with CLIP	100 000	512	Cosine	range	link
Random vectors \ random payload	1 000 000	100	Cosine	match keyword	link
Random vectors \ random payload	1 000 000	100	Cosine	match int	link
Random vectors \ random payload	1 000 000	100	Cosine	range	link
Random vectors \ random payload	1 000 000	100	Cosine	geo-radius	link
Random vectors \ random payload	100 000	2048	Cosine	match keyword	link
Random vectors \ random payload	100 000	2048	Cosine	match int	link
Random vectors \ random payload	100 000	2048	Cosine	range	link
Random vectors \ random payload	100 000	2048	Cosine	geo-radius	link

Data Format

Each dataset contains of following files:

vectors.npy - Numpy matrix of vectors. Shape num_vectors x dim
payloads.jsonl - payload values, associated with vectors. Number of lines equal to num_vectors
tests.jsonl - collection of queries with filtering conditions and expected results. Contains fields:
- query - vector to be used for similarity search
- conditions - filtering conditions of 3 possible types: match, range, and geo
- closest_ids - IDs of records, expected to be found with given query
- closest_scores - similarity scores of associated IDs

Example queries

{
  "query": [-0.034, -0.185, -0.21, ...],
  "conditions": {
    "and": [
      {
        "department_name": {
          "match": {
            "value": "Divided Shoes"
          }
        }
      }
    ]
  },
  "closest_ids": [565, 15631, 100747, ....],
  "closest_scores": [0.734, 0.698, 0.697, 0.689, ...]
}

Sources

Random data generator - script
Image data - kaggle
Image embeddings generator - colab

qdrant / ann-filtering-benchmark-datasets