1e0ng / simhash

A Python Implementation of Simhash Algorithm

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Dimension of simhash fingerprint not always equal to 64

Gladiator566 opened this issue · comments

Hi, appreciate the great work !
There are something I'm confused.

I'm dealing with chinese html text, so i customize get_features function. Specifically, I extract the text content from .html file first, then use jieba.analyse.extract_tags to get topK keywords and its tf-idf weights.

Then I call sh = Simhash(get_features(content)).value to get simhash fingerprint, and I call len(bin(sh)[2:]) to check the dimension of simhash fingerprint, [2:] since there is always '0b' in front of the simhash bytes.
But here comes the question, I found that the dimension of simhash bytes is not always equal to 64 although the default self.f is set to 64, it can be 61,62,63,64 as far as I have seen.

Have you ever encounter this problem before? I really wonder why, am I using the right method to check the dimension of simhash bytes? How could the dimension vary instead of always equal to 64 ? @1e0ng

Thanks !

commented

Hi, this is because some values have leading 0s.
Try this and you will see:

'{0:064b}'.format(sh)

thanks a lot !