1e0ng / simhash

A Python Implementation of Simhash Algorithm

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

different list of tokens with different weight generates the same value?

jdhao opened this issue · comments

commented

I am using this package to generate simhash for my text. I have found that different list of tokens may generate the same hash value, which is confusing to me.

from simhash import Simhash

x = [('恋爱', 2.61855744598), ('脱单', 0.29491400873), ('闪婚', 0.29491400873)]
y = [('恋爱', 3.92783616897), ('结婚', 3.282195995975)]

print(Simhash(x).value)
print(Simhash(x).value)

The last two print statements will print the same hash value 12369878657857125584. I am not sure why does it happen. Is this the intended behavior?

commented

There is a small chance to have a conflict. It's about probabilities. One way to reduce the probability of conflicts is to run simhash multiple times based on different ways of word segmentation.