1e0ng / simhash

A Python Implementation of Simhash Algorithm

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Unable to see near by duplication

sushilr007 opened this issue · comments

commented

My code:

from simhash import Simhash, SimhashIndex

    data = {
        1: 'How are you? I Am fine. blar blar blar blar blar Thanks.',
        2: 'How are you i am fine. blar blar blar blar blar than',
        3: 'This is simhash test.',
    }
    objs = [(str(k), Simhash(v)) for k, v in data.items()]
    index = SimhashIndex(objs)
    print(index.bucket_size())

    # s1 = Simhash(u'How are you i am fine. blar blar blar blar blar thank')
    s1 = Simhash('How are you i am fine. blar blar blar blar blar thank'.split())
    print(index.get_near_dups(s1))

    index.add('4', s1)
    print(index.get_near_dups(s1))

I also tried for

s1 = Simhash('How are you i am fine. blar blar blar blar blar thank'.split())

OUTPUT:

7
[]
['4']

Expected Output:

7
['1']
['4','1']