1e0ng / simhash

A Python Implementation of Simhash Algorithm

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

It seems SimhashIndex is broken

debunge opened this issue · comments

The following example

from simhash import Simhash, SimhashIndex
data = {
    1: u'How are you? I Am fine. blar blar blar blar blar Thanks.',
    2: u'How are you i am fine. blar blar blar blar blar than',
    3: u'This is simhash test.',
    4: u'How are you i am fine. blar blar blar blar blar thank1',
}
objs = [(str(k), Simhash(v)) for k, v in data.items()]
index = SimhashIndex(objs)

print map(lambda x: x[1].value, objs)
print index.bucket_size()

s1 = Simhash(u'How are you i am fine. blar blar blar blar blar thank')
print index.get_near_dups(s1)

index.add(5, s1)
print index.get_near_dups(s1)

produces unexpected result

[8440240427182201978, 8440240356449459322, 9984379969213434071L, 17663612459742043242L]
10
[]
[u'5']

Also passing k=3 or other value into SimhashIndex produces
File "build/bdist.linux-x86_64/egg/simhash/init.py", line 112, in get_near_dups
File "build/bdist.linux-x86_64/egg/simhash/init.py", line 56, in init
Exception: Bad parameter with type <type 'in

commented

@debunge I tried your code. It seems that the distance between How are you i am fine. blar blar blar blar blar thank and How are you i am fine. blar blar blar blar blar than is 5. It shouldn't be that big. One solution is to pass features array instead of the string to Simhash. For the second question, k=3 should be passed to the construction of SimhashIndex, not to get_near_dups.

Thank you for reply!

Please, tell me how to correctly pass k=3 to SimhashIndex?
I'm trying the following:

from simhash import Simhash, SimhashIndex
data = {
    1: u'How are you? I Am fine. blar blar blar blar blar Thanks.',
    2: u'How are you i am fine. blar blar blar blar blar than',
    3: u'This is simhash test.',
    4: u'How are you i am fine. blar blar blar blar blar thank1',
}
objs = [(str(k), Simhash(v)) for k, v in data.items()]
index = SimhashIndex(objs, k=10)

print map(lambda x: x[1].value, objs)
print index.bucket_size()

s1 = Simhash(u'How are you i am fine. blar blar blar blar blar thank')
print index.get_near_dups(s1)

index.add(5, s1)
print index.get_near_dups(s1)

And it fails with
Traceback (most recent call last):
File "test.py", line 15, in
print index.get_near_dups(s1)
File "build/bdist.linux-x86_64/egg/simhash/init.py", line 112, in get_near_dups
File "build/bdist.linux-x86_64/egg/simhash/init.py", line 56, in init
Exception: Bad parameter with type <type 'int'>

commented

I have a similar issue:

File "/usr/local/lib/python2.7/dist-packages/simhash/init.py", line 112, in get_near_dups

sim2 = Simhash(int(sim2, 16), self.f)

File "/usr/local/lib/python2.7/dist-packages/simhash/init.py", line 56, in init

raise Exception('Bad parameter with type {}'.format(type(value)))

Exception: Bad parameter with type <type 'int'="">

When I look at line 112, it's always sending an integer (int(sim2, 16)) as the value to initiate a Simhash object. But int appears not to be among the types that the Simhash init looks for. What type should line 112 send to Simhash?

commented

@debunge @bef55 Thank you for feedback. This issue is caused by a recent merge. I have fixed this issue and uploaded to pypi as well.

Yep, the k is fixed indeed, but some other problem still persists:

from simhash import Simhash, SimhashIndex
data = {
    1: u'How are you? I Am fine. blar blar blar blar blar Thanks.',
    2: u'How are you i am fine. blar blar blar blar blar than',
    3: u'This is simhash test.',
}
objs = [(str(k), Simhash(v)) for k, v in data.items()]
index = SimhashIndex(objs)

print index.bucket_size()

s1 = Simhash(u'How are you i am fine. blar blar blar blar blar thank')
print index.get_near_dups(s1)

index.add('4', s1)
print index.get_near_dups(s1)

from example in your blog produces

7
[]
[u'4']

but I think the second print should list items 1 and 2.

commented

Hi @debunge , you may try to use this method:

s1 = Simhash('How are you i am fine. blar blar blar blar blar thank'.split())

ie, use an array of feature strings.

@LiangSun Thank you very much! It works now!

commented

@debunge You are welcome. I should let you know that the above method doesn't take the sequences of words into account. The question How to find feature array is beyond the scope of this project, so you may try other methods until find a optimized method.

commented

@1e0ng
Below it not working
s1 = Simhash('How are you i am fine. blar blar blar blar blar thank'.split())

My code

from simhash import Simhash, SimhashIndex

    data = {
        1: 'How are you? I Am fine. blar blar blar blar blar Thanks.',
        2: 'How are you i am fine. blar blar blar blar blar than',
        3: 'This is simhash test.',
    }
    objs = [(str(k), Simhash(v)) for k, v in data.items()]
    index = SimhashIndex(objs)
    print(index.bucket_size())

    # s1 = Simhash(u'How are you i am fine. blar blar blar blar blar thank')
    s1 = Simhash('How are you i am fine. blar blar blar blar blar thank'.split())
    print(index.get_near_dups(s1))

    index.add('4', s1)
    print(index.get_near_dups(s1))

OUTPUT:

7
[]
['4']

commented

@sushilr007 Please create a new issue and add what's your expected output.