ekzhu / datasketch

MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW

Home Page:https://ekzhu.github.io/datasketch

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Question] fast search

vince62s opened this issue · comments

Hello,
I am beginning with this API. My use case is as follow:

  1. In a large file, made of 100 millions of lines, I would like to get rid of all lines that have a Jaccard > 0.7 (for instance)
    I looped once with MinHask.bulk to store the hashes.
    Then I double loop to compare line by line => very slow.
  2. same question with File1 compared to File2.

Is there a faster way to accomplish this ?

Thanks

sorry seems like the same as #188