Try using 32-bit MurmurHash instead of concatenated integers for LSH hashes
alexklibisz opened this issue · comments
Several of the LSH implementations currently concatenate k
hash values into an amplified hash value that is indexed and used for retrieval. This amplified hash value has (k + 1) * 4
bytes: the k
integer hash values and the hash table index (1 to L
).
A more space-efficient implementation would use the MurmurHash algorithm to hash the ints to a single 32 bit value.
Here's an example of how this could work:
package com.klibisz.elastiknn.testing
import java.nio.ByteBuffer
import org.apache.lucene.codecs.bloom.MurmurHash2
object MurmurHashExample extends App {
def hashInts(ints: Array[Int]): Int = {
val buf = ByteBuffer.allocate(ints.length * 4)
ints.foreach(buf.putInt)
MurmurHash2.hash(buf.array(), 0, 0, 16)
}
println(hashInts(Array(1, 22, 33, 44))) // -774272470
println(hashInts(Array(1, 22, 44, 33))) // -668602715
}
A more optimized solution would entirely avoid constructing the array and instead compute the amplified hash value incrementally as each smaller hash is computed.
Results from LocalBenchmarks.scala using current master branch:
dataset | similarity | algorithm | mapping | query | k | shards | replicas | parallelQueries | esNodes | esCoresPerNode | esMemoryGb | warmupQueries | minWarmupRounds | maxWarmupRounds | recall | queriesPerSecond | durationMillis |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AnnbFashionMnist | "l2" | LSH | {"elastiknn":{"L":75,"dims":784,"k":4,"model":"lsh","similarity":"l2","w":7},"type":"elastiknn_dense_float_vector"} | {"candidates":1000,"field":"vec","model":"lsh","probes":0,"similarity":"l2","vec":{}} | 100 | 1 | 0 | 1 | 1 | 1 | 4 | 200 | 10 | 10 | 0.53446233 | 129.87013 | 77109 |
AnnbFashionMnist | "l2" | LSH | {"elastiknn":{"L":75,"dims":784,"k":4,"model":"lsh","similarity":"l2","w":7},"type":"elastiknn_dense_float_vector"} | {"candidates":2000,"field":"vec","model":"lsh","probes":3,"similarity":"l2","vec":{}} | 100 | 1 | 0 | 1 | 1 | 1 | 4 | 200 | 10 | 10 | 0.856039 | 104.166664 | 96213 |
AnnbSift | "l2" | LSH | {"elastiknn":{"L":100,"dims":128,"k":4,"model":"lsh","similarity":"l2","w":2},"type":"elastiknn_dense_float_vector"} | {"candidates":5000,"field":"vec","model":"lsh","probes":0,"similarity":"l2","vec":{}} | 100 | 1 | 0 | 1 | 1 | 1 | 4 | 200 | 10 | 10 | 0.82321775 | 38.61004 | 259521 |
The fashion-mnist index was 223mb. The sift index was 1gb.
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
green open 0dd2468a4862c622c12a35b18e39ff4f0d2544e29fd1ae7bc3a6d2a8f04f09fc Q3REM5IGQrWXoxu_tfn1zA 1 0 1000000 0 1gb 1gb
green open 6c0d3f5e225a3bcf28428bac7dab256716dd10a6bade2bae83ba64e93f07e925 smuDjogDT1WmHGAXyLgdvQ 1 0 60000 0 223mb 223mb
Results after taking a first pass at using MurmurHash2 from Lucene library on this branch
dataset | similarity | algorithm | mapping | query | k | shards | replicas | parallelQueries | esNodes | esCoresPerNode | esMemoryGb | warmupQueries | minWarmupRounds | maxWarmupRounds | recall | queriesPerSecond | durationMillis |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AnnbFashionMnist | "l2" | LSH | {"elastiknn":{"L":75,"dims":784,"k":4,"model":"lsh","similarity":"l2","w":7},"type":"elastiknn_dense_float_vector"} | {"candidates":1000,"field":"vec","model":"lsh","probes":0,"similarity":"l2","vec":{}} | 100 | 1 | 0 | 1 | 1 | 1 | 4 | 200 | 10 | 10 | 0.53446233 | 125.0 | 80415 |
AnnbFashionMnist | "l2" | LSH | {"elastiknn":{"L":75,"dims":784,"k":4,"model":"lsh","similarity":"l2","w":7},"type":"elastiknn_dense_float_vector"} | {"candidates":2000,"field":"vec","model":"lsh","probes":3,"similarity":"l2","vec":{}} | 100 | 1 | 0 | 1 | 1 | 1 | 4 | 200 | 10 | 10 | 0.856039 | 92.59259 | 108660 |
AnnbSift | "l2" | LSH | {"elastiknn":{"L":100,"dims":128,"k":4,"model":"lsh","similarity":"l2","w":2},"type":"elastiknn_dense_float_vector"} | {"candidates":5000,"field":"vec","model":"lsh","probes":0,"similarity":"l2","vec":{}} | 100 | 1 | 0 | 1 | 1 | 1 | 4 | 200 | 10 | 10 | 0.8232326 | 27.62431 | 362320 |
(Should probably re-run this. Laptop really heated up here so CPU might've been throttled.)
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
green open 0dd2468a4862c622c12a35b18e39ff4f0d2544e29fd1ae7bc3a6d2a8f04f09fc gLAnMww5RV-r8KAlVTTblQ 1 0 1000000 0 1gb 1gb
green open 6c0d3f5e225a3bcf28428bac7dab256716dd10a6bade2bae83ba64e93f07e925 yc_IoSYMTcOtJWnrDrMwRg 1 0 60000 0 222.8mb 222.8mb