Distributed MinHashLSH

Question

Distributed MinHashLSH

ctrajan opened this issue 3 years ago · comments

If there is an implementation distributed MinHashLSh ? If not, shard the base dataset into several machines is possible?

For example, if my dataset has 10 billion data, which can't fit in the memory, can I shard the dataset in 10 machines (one MinHashLSH index in each machine, which have 1 billion different data in each index). When the query comes, it searches in 10 machines, and gather the search results, the gathered result is the same as search in ONE MinHashLSH which contains 10 billions data?

Eric Zhu · Answer 1 · Sat Mar 11 2023 13:31:54 GMT+0800 (China Standard Time)

It is a good idea to partition the data into multiple machines and build separate indexes. In the example you gave, the result from 10 machines is going to be better than looking at only one machine, if you combine the results and rank them by estimated/exact Jaccard similarity.

Michael Joseph Rosenthal · Answer 2 · Wed Mar 15 2023 05:28:03 GMT+0800 (China Standard Time)

@ekzhu would this be better than, say, using the redis storage layer connected to a redis cluster? Or is the latter just not a good idea at all?

Eric Zhu · Answer 3 · Fri Mar 17 2023 17:41:50 GMT+0800 (China Standard Time)

I have not tried redis cluster. I think this depends on how the index is sharded in redis. But I guess code is probably easier on the client side.

…

On Tue, Mar 14, 2023 at 2:28 PM Michael Joseph Rosenthal < ***@***.***> wrote: @ekzhu <https://github.com/ekzhu> would this be better than, say, using the redis storage layer connected to a redis cluster? Or is the latter just not a good idea at all? — Reply to this email directly, view it on GitHub <#198 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACOGLRNLJID32GAGWBFPWTW4DPG3ANCNFSM6AAAAAAVWRAJRM> . You are receiving this because you were mentioned.Message ID: ***@***.***>