1e0ng / simhash

A Python Implementation of Simhash Algorithm

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question regarding bucket implementation

orapic opened this issue · comments

Hi,

I've read your implementation of finding near duplicates but something doesn't seem clear to me and hoped you could clarify about it. You use a dictionary to store a concatenation of the simhash chunks and index as keys and a concatenation the full simhash and simhash id as values.

keys: yield '%x:%x' % (c, i)

values: for key in self.get_keys(simhash): v = '%x,%s' % (simhash.value, obj_id) self.bucket[key].add(v)

Since dictionaries don't allow to repeat keys as it updates the value, if two simhashes have the same value for the same chunk at the same permutation index (for whatever reason), are we not losing one of the simhashes when building the SimhashIndex object and missing a possible duplicate?

commented

Hi orapic,

I see you are trying to understand the code.
If I understand you right, your question about the dictionary is self.bucket. If that's the case, you can see the value of each dictionary item is a set. so what self.bucket[key].add(v) does is to add v into the set which is a value of the dictionary.
Hope this helps.

Thanks.

Hi again,

Yes, indeed I was trying to understand the code.
Ok, I understand now how the bucket works, thanks a lot!