Question regarding bucket implementation

Question

Question regarding bucket implementation

orapic opened this issue 3 years ago · comments

Hi,

I've read your implementation of finding near duplicates but something doesn't seem clear to me and hoped you could clarify about it. You use a dictionary to store a concatenation of the simhash chunks and index as keys and a concatenation the full simhash and simhash id as values.

keys: yield '%x:%x' % (c, i)

values: for key in self.get_keys(simhash): v = '%x,%s' % (simhash.value, obj_id) self.bucket[key].add(v)

Since dictionaries don't allow to repeat keys as it updates the value, if two simhashes have the same value for the same chunk at the same permutation index (for whatever reason), are we not losing one of the simhashes when building the SimhashIndex object and missing a possible duplicate?

1e0ng · Answer 1 · Thu Feb 11 2021 10:30:40 GMT+0800 (China Standard Time)

Hi orapic,

I see you are trying to understand the code.
If I understand you right, your question about the dictionary is self.bucket. If that's the case, you can see the value of each dictionary item is a set. so what self.bucket[key].add(v) does is to add v into the set which is a value of the dictionary.
Hope this helps.

Thanks.

orapic · Answer 2 · Sun Feb 14 2021 06:05:25 GMT+0800 (China Standard Time)

Hi again,

Yes, indeed I was trying to understand the code.
Ok, I understand now how the bucket works, thanks a lot!