Dimension of simhash fingerprint not always equal to 64
Gladiator566 opened this issue · comments
Hi, appreciate the great work !
There are something I'm confused.
I'm dealing with chinese html text, so i customize get_features
function. Specifically, I extract the text content from .html file first, then use jieba.analyse.extract_tags
to get topK keywords and its tf-idf weights.
Then I call sh = Simhash(get_features(content)).value
to get simhash fingerprint, and I call len(bin(sh)[2:])
to check the dimension of simhash fingerprint, [2:]
since there is always '0b' in front of the simhash bytes.
But here comes the question, I found that the dimension of simhash bytes is not always equal to 64 although the default self.f
is set to 64, it can be 61,62,63,64 as far as I have seen.
Have you ever encounter this problem before? I really wonder why, am I using the right method to check the dimension of simhash bytes? How could the dimension vary instead of always equal to 64 ? @1e0ng
Thanks !
Hi, this is because some values have leading 0s.
Try this and you will see:
'{0:064b}'.format(sh)
thanks a lot !