Distance between two Simhash is not the same as other Simhash implementations.
NicolasAubry opened this issue · comments
if v[i] > 0:
Other Simhash implementation don't set the bit in the result hash to 1 if the count in the result vector is 0. examples: https://github.com/seomoz/simhash-cpp/blob/e7aacb1642f406ff0815cf402e909d2002473812/src/simhash.cpp
and https://github.com/admazely/simhash/blob/master/main.js
You can also check this paper of the university of saskatchewan (page 3): https://www.cs.usask.ca/~croy/papers/2011/URKH_WCRE2011_simCad.pdf
Hi, thanks for pointing this out. May I ask is this just a convention thing, or does it affect the result?
Fixed.
I edited my last comment because i forgot to insert a link. To answer your question, the simhash of two identifcal strings will be different if one is compute with your implementation and the other with another implementation. Thank you for the fix
Thanks for updating the link to that paper. 👍