alexklibisz / elastiknn

Elasticsearch plugin for nearest neighbor search. Store vectors and run similarity search using exact and approximate algorithms.

Home Page:https://alexklibisz.github.io/elastiknn

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Try using 32-bit MurmurHash instead of concatenated integers for LSH hashes

alexklibisz opened this issue · comments

Several of the LSH implementations currently concatenate k hash values into an amplified hash value that is indexed and used for retrieval. This amplified hash value has (k + 1) * 4 bytes: the k integer hash values and the hash table index (1 to L).

A more space-efficient implementation would use the MurmurHash algorithm to hash the ints to a single 32 bit value.

Here's an example of how this could work:

package com.klibisz.elastiknn.testing

import java.nio.ByteBuffer
import org.apache.lucene.codecs.bloom.MurmurHash2

object MurmurHashExample extends App {
  def hashInts(ints: Array[Int]): Int = {
    val buf = ByteBuffer.allocate(ints.length * 4)
    ints.foreach(buf.putInt)
    MurmurHash2.hash(buf.array(), 0, 0, 16)
  }
  println(hashInts(Array(1, 22, 33, 44))) // -774272470
  println(hashInts(Array(1, 22, 44, 33))) // -668602715
}

A more optimized solution would entirely avoid constructing the array and instead compute the amplified hash value incrementally as each smaller hash is computed.

Results from LocalBenchmarks.scala using current master branch:

dataset similarity algorithm mapping query k shards replicas parallelQueries esNodes esCoresPerNode esMemoryGb warmupQueries minWarmupRounds maxWarmupRounds recall queriesPerSecond durationMillis
AnnbFashionMnist "l2" LSH {"elastiknn":{"L":75,"dims":784,"k":4,"model":"lsh","similarity":"l2","w":7},"type":"elastiknn_dense_float_vector"} {"candidates":1000,"field":"vec","model":"lsh","probes":0,"similarity":"l2","vec":{}} 100 1 0 1 1 1 4 200 10 10 0.53446233 129.87013 77109
AnnbFashionMnist "l2" LSH {"elastiknn":{"L":75,"dims":784,"k":4,"model":"lsh","similarity":"l2","w":7},"type":"elastiknn_dense_float_vector"} {"candidates":2000,"field":"vec","model":"lsh","probes":3,"similarity":"l2","vec":{}} 100 1 0 1 1 1 4 200 10 10 0.856039 104.166664 96213
AnnbSift "l2" LSH {"elastiknn":{"L":100,"dims":128,"k":4,"model":"lsh","similarity":"l2","w":2},"type":"elastiknn_dense_float_vector"} {"candidates":5000,"field":"vec","model":"lsh","probes":0,"similarity":"l2","vec":{}} 100 1 0 1 1 1 4 200 10 10 0.82321775 38.61004 259521

The fashion-mnist index was 223mb. The sift index was 1gb.

health status index                                                            uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   0dd2468a4862c622c12a35b18e39ff4f0d2544e29fd1ae7bc3a6d2a8f04f09fc Q3REM5IGQrWXoxu_tfn1zA   1   0    1000000            0        1gb            1gb
green  open   6c0d3f5e225a3bcf28428bac7dab256716dd10a6bade2bae83ba64e93f07e925 smuDjogDT1WmHGAXyLgdvQ   1   0      60000            0      223mb          223mb

Results after taking a first pass at using MurmurHash2 from Lucene library on this branch

dataset similarity algorithm mapping query k shards replicas parallelQueries esNodes esCoresPerNode esMemoryGb warmupQueries minWarmupRounds maxWarmupRounds recall queriesPerSecond durationMillis
AnnbFashionMnist "l2" LSH {"elastiknn":{"L":75,"dims":784,"k":4,"model":"lsh","similarity":"l2","w":7},"type":"elastiknn_dense_float_vector"} {"candidates":1000,"field":"vec","model":"lsh","probes":0,"similarity":"l2","vec":{}} 100 1 0 1 1 1 4 200 10 10 0.53446233 125.0 80415
AnnbFashionMnist "l2" LSH {"elastiknn":{"L":75,"dims":784,"k":4,"model":"lsh","similarity":"l2","w":7},"type":"elastiknn_dense_float_vector"} {"candidates":2000,"field":"vec","model":"lsh","probes":3,"similarity":"l2","vec":{}} 100 1 0 1 1 1 4 200 10 10 0.856039 92.59259 108660
AnnbSift "l2" LSH {"elastiknn":{"L":100,"dims":128,"k":4,"model":"lsh","similarity":"l2","w":2},"type":"elastiknn_dense_float_vector"} {"candidates":5000,"field":"vec","model":"lsh","probes":0,"similarity":"l2","vec":{}} 100 1 0 1 1 1 4 200 10 10 0.8232326 27.62431 362320

(Should probably re-run this. Laptop really heated up here so CPU might've been throttled.)

health status index                                                            uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   0dd2468a4862c622c12a35b18e39ff4f0d2544e29fd1ae7bc3a6d2a8f04f09fc gLAnMww5RV-r8KAlVTTblQ   1   0    1000000            0        1gb            1gb
green  open   6c0d3f5e225a3bcf28428bac7dab256716dd10a6bade2bae83ba64e93f07e925 yc_IoSYMTcOtJWnrDrMwRg   1   0      60000            0    222.8mb        222.8mb