arthurfeeney / Blimps-Lib

Header-only C++/Python library for approximate maximum inner product search and near neighbor search

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Blimps Lib

Original paper on Norm-Ranging Locality Sensitive Hashing of which I am not an author. The authors claim that it improves SimpleLSH.

This implementation is header-only at the moment. Matrix operations use Eigen. It uses boost/multiprecision to allow for very long hash codes. Tests use Catch2. This is header-only as well.

In the bind directory, there are some minimal python bindings that were made with pybind11. The movielenstest uses these bindings to make a pureSVD recommender system for the MovieLens 10M dataset in the spirit of ALSH paper, which I am also not an author of.

It also provides a simple C++ implementation of and Python bindings for vanilla-LSH.

The tables all expose these functions for inner products and distance, where adj is the number of buckets to probe.

  • probe(query, adj)
  • k_probe(k, query, adj)
  • probe_approx(query, c, adj)
  • k_probe_approx(k, query, c, adj)
  • contains(query)

Usage

All needed libraries are included in external/. So, it should be decently portable. unit tests and synthetic data tests can be run using the Makefile. Running the movielenstest will not work since the data is not included in this repository. The examples in pyexamples and synthetic show how to use the library. The Python bindings can be compiled used "make binding".

Examples

Examples can be found in synthetic/ and pyexamples/ that respectively contain code using C++ and Python bindings.

LSH for Maximum Inner Product Search

Bachrach et al. show that MIPS is reducible to Near Neighbor Search, but the dimension must be increased. So, if the data is transformed, it is possible to use LSH for MIPS! The transformation used by SimpleLSH, that maps the d-dimensional unit ball onto the (d+1)-dimensional unit sphere, is defined as

\Large P(x)= \big(x, \sqrt{1 - ||x||_2^2}\big)

This makes vectors with large inner products also (probably) have similar hashes (on the unit sphere, the nearest vector is also the one that maximizes inner product). The following plots highlight this. Using SimpleLSH with 32 bits, we plot inner product vs hash similarity. We generated many random 50-dim vectors. For a random query q, the x-axis is the inner product with q. The y-axis is the hashes' similarity with q's hash. Similarity is defined as the number of matching bits. The black line is the best fit line using np.polyfit. As you can see, as the inner product increases, the similarity tends to increase as well.

The source code to make these plots is in synthetic/simple_lsh_plot.py.

inner product vs hash similarity

About

Header-only C++/Python library for approximate maximum inner product search and near neighbor search

License:GNU General Public License v3.0


Languages

Language:C++ 75.5%Language:Python 23.7%Language:Makefile 0.7%