It seems SimhashIndex is broken
debunge opened this issue · comments
The following example
from simhash import Simhash, SimhashIndex
data = {
1: u'How are you? I Am fine. blar blar blar blar blar Thanks.',
2: u'How are you i am fine. blar blar blar blar blar than',
3: u'This is simhash test.',
4: u'How are you i am fine. blar blar blar blar blar thank1',
}
objs = [(str(k), Simhash(v)) for k, v in data.items()]
index = SimhashIndex(objs)
print map(lambda x: x[1].value, objs)
print index.bucket_size()
s1 = Simhash(u'How are you i am fine. blar blar blar blar blar thank')
print index.get_near_dups(s1)
index.add(5, s1)
print index.get_near_dups(s1)
produces unexpected result
[8440240427182201978, 8440240356449459322, 9984379969213434071L, 17663612459742043242L]
10
[]
[u'5']
Also passing k=3 or other value into SimhashIndex produces
File "build/bdist.linux-x86_64/egg/simhash/init.py", line 112, in get_near_dups
File "build/bdist.linux-x86_64/egg/simhash/init.py", line 56, in init
Exception: Bad parameter with type <type 'in
@debunge I tried your code. It seems that the distance between How are you i am fine. blar blar blar blar blar thank
and How are you i am fine. blar blar blar blar blar than
is 5. It shouldn't be that big. One solution is to pass features array instead of the string to Simhash. For the second question, k=3
should be passed to the construction of SimhashIndex, not to get_near_dups
.
Thank you for reply!
Please, tell me how to correctly pass k=3 to SimhashIndex?
I'm trying the following:
from simhash import Simhash, SimhashIndex
data = {
1: u'How are you? I Am fine. blar blar blar blar blar Thanks.',
2: u'How are you i am fine. blar blar blar blar blar than',
3: u'This is simhash test.',
4: u'How are you i am fine. blar blar blar blar blar thank1',
}
objs = [(str(k), Simhash(v)) for k, v in data.items()]
index = SimhashIndex(objs, k=10)
print map(lambda x: x[1].value, objs)
print index.bucket_size()
s1 = Simhash(u'How are you i am fine. blar blar blar blar blar thank')
print index.get_near_dups(s1)
index.add(5, s1)
print index.get_near_dups(s1)
And it fails with
Traceback (most recent call last):
File "test.py", line 15, in
print index.get_near_dups(s1)
File "build/bdist.linux-x86_64/egg/simhash/init.py", line 112, in get_near_dups
File "build/bdist.linux-x86_64/egg/simhash/init.py", line 56, in init
Exception: Bad parameter with type <type 'int'>
I have a similar issue:
File "/usr/local/lib/python2.7/dist-packages/simhash/init.py", line 112, in get_near_dups
sim2 = Simhash(int(sim2, 16), self.f)
File "/usr/local/lib/python2.7/dist-packages/simhash/init.py", line 56, in init
raise Exception('Bad parameter with type {}'.format(type(value)))
Exception: Bad parameter with type <type 'int'="">
When I look at line 112, it's always sending an integer (int(sim2, 16)) as the value to initiate a Simhash object. But int appears not to be among the types that the Simhash init looks for. What type should line 112 send to Simhash?
Yep, the k is fixed indeed, but some other problem still persists:
from simhash import Simhash, SimhashIndex
data = {
1: u'How are you? I Am fine. blar blar blar blar blar Thanks.',
2: u'How are you i am fine. blar blar blar blar blar than',
3: u'This is simhash test.',
}
objs = [(str(k), Simhash(v)) for k, v in data.items()]
index = SimhashIndex(objs)
print index.bucket_size()
s1 = Simhash(u'How are you i am fine. blar blar blar blar blar thank')
print index.get_near_dups(s1)
index.add('4', s1)
print index.get_near_dups(s1)
from example in your blog produces
7
[]
[u'4']
but I think the second print should list items 1 and 2.
Hi @debunge , you may try to use this method:
s1 = Simhash('How are you i am fine. blar blar blar blar blar thank'.split())
ie, use an array of feature strings.
@debunge You are welcome. I should let you know that the above method doesn't take the sequences of words into account. The question How to find feature array is beyond the scope of this project, so you may try other methods until find a optimized method.
@1e0ng
Below it not working
s1 = Simhash('How are you i am fine. blar blar blar blar blar thank'.split())
My code
from simhash import Simhash, SimhashIndex
data = {
1: 'How are you? I Am fine. blar blar blar blar blar Thanks.',
2: 'How are you i am fine. blar blar blar blar blar than',
3: 'This is simhash test.',
}
objs = [(str(k), Simhash(v)) for k, v in data.items()]
index = SimhashIndex(objs)
print(index.bucket_size())
# s1 = Simhash(u'How are you i am fine. blar blar blar blar blar thank')
s1 = Simhash('How are you i am fine. blar blar blar blar blar thank'.split())
print(index.get_near_dups(s1))
index.add('4', s1)
print(index.get_near_dups(s1))
OUTPUT:
7
[]
['4']
@sushilr007 Please create a new issue and add what's your expected output.