1e0ng / simhash

A Python Implementation of Simhash Algorithm

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Distance between two Simhash is not the same as other Simhash implementations.

NicolasAubry opened this issue · comments

should be

    if v[i] > 0:

Other Simhash implementation don't set the bit in the result hash to 1 if the count in the result vector is 0. examples: https://github.com/seomoz/simhash-cpp/blob/e7aacb1642f406ff0815cf402e909d2002473812/src/simhash.cpp
and https://github.com/admazely/simhash/blob/master/main.js
You can also check this paper of the university of saskatchewan (page 3): https://www.cs.usask.ca/~croy/papers/2011/URKH_WCRE2011_simCad.pdf


Hi, thanks for pointing this out. May I ask is this just a convention thing, or does it affect the result?



I edited my last comment because i forgot to insert a link. To answer your question, the simhash of two identifcal strings will be different if one is compute with your implementation and the other with another implementation. Thank you for the fix


Thanks for updating the link to that paper. 👍