Good hash function for 64bit keys?
capr opened this issue · comments
Hi!
I noticed that khash is very sensitive to the choice of hash function for >= 64bit keys (not so for 32bit keys). In particular kh_int64_hash_func()
is extremely slow with random input. Does anyone have a similar experience? Is it expected to have wildly varying results based on the choice of hash function?
Thanks!
You can use the following. It is a little slow, but hash table is usually much slower than hashing.
uint64_t hash_64(uint64_t key)
{
key = ~key + (key << 21);
key = key ^ key >> 24;
key = (key + (key << 3)) + (key << 8);
key = key ^ key >> 14;
key = (key + (key << 2)) + (key << 4);
key = key ^ key >> 28;
key = key + (key << 31);
return key;
}
Hi @attractivechaos, thanks for answering and for the hash function. I still have trouble figuring this out. For instance you say that
hash table is usually much slower than hashing
Ok, so here's my results with the above hash_64
function:
key size: 4, inserts: 1000000, unique keys: 25%, mil/sec: 49.091
key size: 4, lookups: 1000000, unique keys: 25%, mil/sec: 307.542
key size: 8, inserts: 1000000, unique keys: 25%, mil/sec: 19.614
key size: 8, lookups: 1000000, unique keys: 25%, mil/sec: 39.554
8x slower than with 32bit keys, which is fine given that twice as many 32bit keys fit into the CPU caches than 64bit keys, the 32bit hash function is a noop, etc.
But then if I switch to simpler 64bit multiplicative hash I get the following:
key size: 4, inserts: 1000000, unique keys: 25%, mil/sec: 46.969
key size: 4, lookups: 1000000, unique keys: 25%, mil/sec: 308.119
key size: 8, inserts: 1000000, unique keys: 25%, mil/sec: 20.004
key size: 8, lookups: 1000000, unique keys: 25%, mil/sec: 109.173
So 1.5x speed just by changing the hash function.
Now if I switch khint_t
to int64_t
and use the identity hash function (so no collisions), I get this:
key size: 4, inserts: 1000000, unique keys: 25%, mil/sec: 49.293
key size: 4, lookups: 1000000, unique keys: 25%, mil/sec: 322.876
key size: 8, inserts: 1000000, unique keys: 25%, mil/sec: 34.743
key size: 8, lookups: 1000000, unique keys: 25%, mil/sec: 184.550
This is probably as good as it gets.
I am still curious about other people's experience with using different hash functions because:
- the theory of why the results vary so wildly is above my head, although obviously less collisions is better than more collisions on any implementation.
- you don't mention or seem to have tested khash using different hash functions in your blog posts.
Now it's quite possible that the slowness is dude to a bug in my code, but I think it would still help to know if other users of khash have had a similar experience or it's just me.
Microbenchmarks are tricky. Showing numbers means little without the source code.
You're right. I wasn't showing the code because it's not in C, it's actually a port of khash in Terra but I'm not sure that it matters much, unless it's a translation bug (Terra has the same memory model as C it being a LLVM frontend). Anyway here it is, FWIW: https://github.com/luapower/khash.