ray-search

Simple library using a builder pattern to solve for top-k search and building text indexes in parallel across multiple nodes. It uses apache ray under the hood and is primarily designed to run on databricks clusters. It is a work in progress and currently experimental. The apis will break. Please use at your own peril.

Quick install

pip install git+https://github.com/stikkireddy/ray-search.git

Additionally please install faiss-cpu or faiss-gpu for indexing

pip install faiss-cpu

Start long term ray session locally to access dashboard and test locally

ray start \
      --head \
      --port=6379 \
      --disable-usage-stats \
      --object-manager-port=8076 \
      --include-dashboard=true \
      --dashboard-host=0.0.0.0 \
      --dashboard-port=8266
ray start --head --include-dashboard=true --disable-usage-stats --dashboard-port 9090 --node-ip-address=0.0.0.0

import pandas as pd

from ray_search import *

data = pd.read_csv("data/luggage_review.csv")
ids = data['review_id'].values.astype('U').tolist()
texts = data['review_body_trunc'].values.astype('U').tolist()

worker_ct = 5

matrix_builder: SearchMatrixBuilder = SearchMatrixBuilder() \
        .with_content(ids,
                      texts) \
        .with_workers(worker_ct) \
        .with_worker_memory(Memory.in_mb(512)) \
        .as_512_chunk() \
        .with_vectorizer(Vectorizers.SentenceTransformerDense)

search_builder: SearchBuilder = SearchBuilder() \
      .with_workers(worker_ct) \
      .with_searcher(Searchers.FaissANNSearch) \
      .with_worker_memory(Memory.in_mb(2046)) \
      .with_chunk_size(16) \
      .with_top_k_per_entity(5)
  
results = SearchPipeline() \
      .with_search_matrix_builder(matrix_builder) \
      .with_search_builder(search_builder) \
      .to_df_unnested()

print(results.head(10))

Roadmap

There are three core components that this library offers. Ability to vectorize your code in parallel using the state of the art. It helps you then run distributed top-k search, given a set of constraints. It also provides sinks for the vectors to be saved in a given format/index to disk.

Vectorizers - Creating dense/sparse vectors

Splade sparse vector support creation
TF-IDF vectorizer creation (single actor)
Distributed Count Vectorizer TF-IDF transformer creation (multi actor)
- can be achieved by generating distributed CSR matrices and merging them
Hybrid vectors (Splade sparse + sentencetransformer dense vectors) afaik pinecone is the only one which supports this
Dense vectors with sentence transformers
Huggingface embedding creation
Custom Vectorizer (single actor)
Custom Chunked Vectorizer (multi actor)

Search - Creating dense sparse vectors

Brute force compute Cosine Top K
Faiss Top K with IndexFlatIP index
Custom faiss index strategy via func
- right now the index isn't built and distributed
Annoy Top k
MinHash Lsh, MinHash Lsh Forest, MinHash Lsh Ensemble for k ANN search

Sink - Write the vectorized output to some persistent storage

Note: Experimental

This library is a work in progress the api will break. Please do not use for production.

About

Apache License 2.0

Languages

Language:Python 100.0%