1e0ng / simhash

A Python Implementation of Simhash Algorithm

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Seemly wrong result

xiaofen9 opened this issue · comments

>>> a
u'\u0114\u0115\u0116\u0117\u0118\u0119\u011a\u0119\u011a\u0119\u011a\u0119\u011a\u0119\u011a\u011b\u011c\u0114\u0115\u0116\u0117\u0118\u0119\u011a\u0119\u011a\u0119\u011a\u0119\u011a\u011b\u011c\u0114\u0115\u0116\u0117\u0118\u0119\u011a\u0119\u011a\u0119\u011a\u0119\u011a\u0119\u011a\u011b\u011c\u0114\u0115\u0116\u0117\u0118\u0119\u011a\u0119\u011a\u0119\u011a\u011b\u011c\u011d\u011e\u0114'
>>> b
u"!#$%$%&'()*+,\xa9\xaa\xab\xac\xad\xae\xaf\xae\xaf\xb0\xb1\xb2<=>?"
>>> Simhash(a).distance(Simhash(b))
0
commented

Thanks for reporting this. I would recommend using a meaningful word segmentation before running the simhash.

text1 = '谢谢童学们的支持'
text2 = '谢谢同学们的支持'
words1 = list(words1)
words2 = list(words2)
print(Simhash(words1).distance(Simhash(words2))) 

the distance is 25. it seems wrong? Does it?

commented

Hi, maybe you can do some pre-processing to fix the typo? Also, you can create a new issue for a different question.