Jaccard should be performed on sets, but appears to be given numpy arrays
thesamuel opened this issue · comments
It appears that MinHash.jaccard
is expecting two sets to be given here, where the &
and |
are used for set intersection and union, respectively:
Line 76 in da67215
From what I understand, it's being passed two numpy arrays from Cache
(since they're outputs of the fingerprint functions):
Lines 65 to 66 in da67215
The code doesn't raise an exception, because &
and |
are overloaded for numpy, but I'm concerned that this may not be computing jaccard correctly.
From my testing, I found that this jaccard function did not work as expected (didn't filter any candidates).
I apologize if I'm not understanding this correctly, please correct me if I'm wrong!
Thanks for reporting this, I think you're right. There's a special case handling string that are passed in MinHash.jaccard
that does turn the fingerprint into a set, but the numpy array is left untouched.
@thesamuel I've opened a branch fix/filter_duplicates
to cast the document fingerprints to sets. You mentioned you had some tests for this, I'd be interested in what those tests say on the new branch.
Please also note that the jaccard function in the hasher is not computing the exact jaccard similarity of two documents, if the number of hashes in the hasher is too low the jaccard
function will be very unstable.
@mattilyra Thanks for taking a look at this, I think your fix looks good. I don't have any unit tests implemented, but I think it would be worth adding a test that checks whether a nonzero number of duplicate candidates are filtered when a high threshold is used.