The dedup code does not utilize LSH for calculating similarity?

Question

The dedup code does not utilize LSH for calculating similarity?

SefaZeng opened this issue 4 months ago · comments

The readme in dedup part saied:

In this step, we build a MinHashLSH index and query it to locate near duplicates Chapter 3, Mining of Massive Datasets. We are using Jaccard similarity threshold of 0.8 to determine whether a pair of documents should be considered as a duplicate. Our implementation is using --range and --bands arguments that can be calculated with datasketch/lsh.py given a Jaccard threshold. We find aggressive deduplication the most efficient, but you can change the parameters below in order to reduce the amount of filtered content.

But the code in dedup/generate_duplicate_pairs.py only checks for duplicate hash values and does not calculate similarity. And there is no code for setting a threshold for Jaccard similarity.

My understanding is that we should construct a MinHashLSH and insert all the hash values into it. Then, for each hash, we should use lsh.query(hash) to check if there are similar items.

I'm not very familiar with this part, so please correct me if my understanding is wrong.