mattilyra / LSH

Locality Sensitive Hashing using MinHash in Python/Cython to detect near duplicate text documents

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Jaccard should be performed on sets, but appears to be given numpy arrays

thesamuel opened this issue · comments

It appears that MinHash.jaccard is expecting two sets to be given here, where the & and | are used for set intersection and union, respectively:

return len(f_a & f_b) / len(f_a | f_b)

From what I understand, it's being passed two numpy arrays from Cache (since they're outputs of the fingerprint functions):

LSH/lsh/cache.py

Lines 65 to 66 in da67215

jaccard = self.hasher.jaccard(self.fingerprints[id1],
self.fingerprints[id2])

The code doesn't raise an exception, because & and | are overloaded for numpy, but I'm concerned that this may not be computing jaccard correctly.

From my testing, I found that this jaccard function did not work as expected (didn't filter any candidates).

I apologize if I'm not understanding this correctly, please correct me if I'm wrong!

Thanks for reporting this, I think you're right. There's a special case handling string that are passed in MinHash.jaccard that does turn the fingerprint into a set, but the numpy array is left untouched.

@thesamuel I've opened a branch fix/filter_duplicates to cast the document fingerprints to sets. You mentioned you had some tests for this, I'd be interested in what those tests say on the new branch.

Please also note that the jaccard function in the hasher is not computing the exact jaccard similarity of two documents, if the number of hashes in the hasher is too low the jaccard function will be very unstable.

@mattilyra Thanks for taking a look at this, I think your fix looks good. I don't have any unit tests implemented, but I think it would be worth adding a test that checks whether a nonzero number of duplicate candidates are filtered when a high threshold is used.