ekzhu / datasketch

MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW

Home Page:https://ekzhu.github.io/datasketch

Repository from Github https://github.comekzhu/datasketchRepository from Github https://github.comekzhu/datasketch

Obtain hashvalues by key from MinHashLSH

pavelnemirovsky opened this issue · comments

Guys, I am desperately looking for the ability to obtain hash values as an array of int(s) based on provided key? Any direction? Thanks in advance,
P

MinHashLSH is designed for looking up keys given hashvalues (i.e., MinHash), but does not natively support the reverse lookup. I think a simple dictionary for key-> hash values would be helpful.

@ekzhu can you give an example of that code? I spent some time and didn't get how to make it happen in the right way? Appreciate your help

@ekzhu I wrote the code that recovers original min-hashes of a document obtained from MinHashLSH (Cassandra storage), but I found a frustrating issue (maybe expected) that not all items from the array of min-hash permutations are properly stored in LSH index (regardless it is Cassandra storage or not).

The conclusions are looks as following:

Index With Num of perm: 128, Bands: 11, Items in Band: 11
5.46875 % of min-hashes will be lost / won't be stored in LSH storage
--
Index With Num of perm: 120, Bands: 12, Items in Band: 10
0.0 % of min-hashes will be lost / won't be stored in LSH storage
--
Index With Numof perm: 64, Bands: 7, Items in Band: 9
1.5625 % of min-hashes will be lost / won't be stored in LSH storage
--
Index With Num of perm: 32, Bands: 4, Items in Band: 8
0.0 % of min-hashes will be lost / won't be stored in LSH storage
--
Index With Num of perm: 16, Bands: 2, Items in Band: 6
25.0 % of min-hashes will be lost / won't be stored in LSH storage

The script was used for testing the above behavior is here

Root cause is derived from _optimal_param function which exists in MinHashLSH class.

Thanks and sorry if this is the expected behavior of LSH implementation (didn't have a chance to deep dive into it)

@ekzhu your advice is highly appreciated, PING

Thanks for the interesting benchmark. Yes, your observation is correct. LSH asks for a fixed band-size. So, if the optimizer returns a band-size that doesn't divide the number of permutation functions evently, some minimum hash values will be lost.

If we were to constraint the optimization space to only band-sizes that are integer divisors of num_perm, then we would have less accurate index on average.

@ekzhu understood, so if I'll use 8 bands with 16 items in this case accuracy of prediction will be a little less accurate right?

If your num_perm = 128 and you use 8 x 16 your will be using all the hashvalues, but your accuracy may not be better than using 12 x 10. This depends on the threshold you use, and the type of data have. Since we cannot predict what data you are going to put in the index, the best-effort optimization is performed with threshold only.

@ekzhu understood, thx
but it will be nice that _optimal_param will pick the values which will allow to restore min-hash values per-key basis, don't you think so?

a similar discussion regarding _optimal_param. #200

@pavelnemirovsky True, maybe a consideration is to refactor the hyper-parameter optimization out of MinHashLSH so user can choose what objective function they would like to use.