Distributed MinHashLSH
ctrajan opened this issue · comments
If there is an implementation distributed MinHashLSh ? If not, shard the base dataset into several machines is possible?
For example, if my dataset has 10 billion data, which can't fit in the memory, can I shard the dataset in 10 machines (one MinHashLSH index in each machine, which have 1 billion different data in each index). When the query comes, it searches in 10 machines, and gather the search results, the gathered result is the same as search in ONE MinHashLSH which contains 10 billions data?
It is a good idea to partition the data into multiple machines and build separate indexes. In the example you gave, the result from 10 machines is going to be better than looking at only one machine, if you combine the results and rank them by estimated/exact Jaccard similarity.
@ekzhu would this be better than, say, using the redis storage layer connected to a redis cluster? Or is the latter just not a good idea at all?