1e0ng / simhash

A Python Implementation of Simhash Algorithm

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Distance between two Simhash is not the same as other Simhash implementations.

NicolasAubry opened this issue · comments

https://github.com/leonsim/simhash/blob/5ea7411823d86c61555fe5f7f8b76b85f97cdc01/simhash/__init__.py#L94
should be

    if v[i] > 0:

Other Simhash implementation don't set the bit in the result hash to 1 if the count in the result vector is 0. examples: https://github.com/seomoz/simhash-cpp/blob/e7aacb1642f406ff0815cf402e909d2002473812/src/simhash.cpp
and https://github.com/admazely/simhash/blob/master/main.js
You can also check this paper of the university of saskatchewan (page 3): https://www.cs.usask.ca/~croy/papers/2011/URKH_WCRE2011_simCad.pdf

commented

Hi, thanks for pointing this out. May I ask is this just a convention thing, or does it affect the result?

commented

Fixed.

I edited my last comment because i forgot to insert a link. To answer your question, the simhash of two identifcal strings will be different if one is compute with your implementation and the other with another implementation. Thank you for the fix

commented

Thanks for updating the link to that paper. 👍