Fast simhash calculation using numpy
jcushman opened this issue · comments
I'm just opening this for documentation / FYI, since I previously sent a pull request on optimizations (#48) -- feel free to close. :)
It is possible to calculate simhashes about 10 times as fast with a dependency on numpy. Here's a simple example:
import numpy as np
import hashlib
import simhash
def hashfunc(obj):
return hashlib.md5(obj.encode('utf8')).digest()[-8:]
def np_simhash(features):
bytestring = b''.join(hashfunc(f) * w for f, w in features)
bitarray = np.unpackbits(np.frombuffer(bytestring, dtype='>B'))
rows = np.reshape(bitarray, (-1, 64))
sums = np.sum(rows, 0)
return int.from_bytes(np.packbits(sums > len(features) // 2).tobytes(), "big")
features = [("foo", 1), ("bar", 2)]
assert np_simhash(features) == simhash.Simhash(features).value
This runs about 10 times as fast in my testing (using randomly generated documents of a few hundred tokens, and also real-life text documents in 3-word shingles). With this implementation the underlying hash function becomes the bottleneck, instead of bit counting being the bottleneck as it currently is.
This might not be a good fit for this library, because of the dependency on numpy. My simple example would also need a little tweaking to keep the API of the current library; and it might want to do the summing in batches to avoid using too much RAM; and it would need some sort of fallback for large weights where * w
is a bad idea. But I wanted to share in case the speedup is of interest.