Fast simhash calculation using numpy

Question

Fast simhash calculation using numpy

jcushman opened this issue 4 years ago · comments

I'm just opening this for documentation / FYI, since I previously sent a pull request on optimizations (#48) -- feel free to close. :)

It is possible to calculate simhashes about 10 times as fast with a dependency on numpy. Here's a simple example:

import numpy as np
import hashlib
import simhash

def hashfunc(obj):
    return hashlib.md5(obj.encode('utf8')).digest()[-8:]

def np_simhash(features):
    bytestring = b''.join(hashfunc(f) * w for f, w in features)
    bitarray = np.unpackbits(np.frombuffer(bytestring, dtype='>B'))
    rows = np.reshape(bitarray, (-1, 64))
    sums = np.sum(rows, 0)
    return int.from_bytes(np.packbits(sums > len(features) // 2).tobytes(), "big")

features = [("foo", 1), ("bar", 2)]

assert np_simhash(features) == simhash.Simhash(features).value

This runs about 10 times as fast in my testing (using randomly generated documents of a few hundred tokens, and also real-life text documents in 3-word shingles). With this implementation the underlying hash function becomes the bottleneck, instead of bit counting being the bottleneck as it currently is.

This might not be a good fit for this library, because of the dependency on numpy. My simple example would also need a little tweaking to keep the API of the current library; and it might want to do the summing in batches to avoid using too much RAM; and it would need some sort of fallback for large weights where * w is a bad idea. But I wanted to share in case the speedup is of interest.

1e0ng · Answer 1 · Tue Oct 06 2020 17:56:55 GMT+0800 (China Standard Time)

Hi @jcushman performance is always one of the interests. Dependency is not a problem as long as it's platform-independent and we can easily install it in all most used operating systems. Why don't you create a pull request for that?