different list of tokens with different weight generates the same value?
jdhao opened this issue · comments
I am using this package to generate simhash for my text. I have found that different list of tokens may generate the same hash value, which is confusing to me.
from simhash import Simhash
x = [('恋爱', 2.61855744598), ('脱单', 0.29491400873), ('闪婚', 0.29491400873)]
y = [('恋爱', 3.92783616897), ('结婚', 3.282195995975)]
print(Simhash(x).value)
print(Simhash(x).value)
The last two print statements will print the same hash value 12369878657857125584
. I am not sure why does it happen. Is this the intended behavior?
There is a small chance to have a conflict. It's about probabilities. One way to reduce the probability of conflicts is to run simhash multiple times based on different ways of word segmentation.