Unable to see near by duplication
sushilr007 opened this issue · comments
Sheel commented
My code:
from simhash import Simhash, SimhashIndex
data = {
1: 'How are you? I Am fine. blar blar blar blar blar Thanks.',
2: 'How are you i am fine. blar blar blar blar blar than',
3: 'This is simhash test.',
}
objs = [(str(k), Simhash(v)) for k, v in data.items()]
index = SimhashIndex(objs)
print(index.bucket_size())
# s1 = Simhash(u'How are you i am fine. blar blar blar blar blar thank')
s1 = Simhash('How are you i am fine. blar blar blar blar blar thank'.split())
print(index.get_near_dups(s1))
index.add('4', s1)
print(index.get_near_dups(s1))
I also tried for
s1 = Simhash('How are you i am fine. blar blar blar blar blar thank'.split())
OUTPUT:
7
[]
['4']
Expected Output:
7
['1']
['4','1']