Good hash function for 64bit keys?

Question

Good hash function for 64bit keys?

capr opened this issue 6 years ago · comments

Hi!

I noticed that khash is very sensitive to the choice of hash function for >= 64bit keys (not so for 32bit keys). In particular kh_int64_hash_func() is extremely slow with random input. Does anyone have a similar experience? Is it expected to have wildly varying results based on the choice of hash function?

Thanks!

Attractive Chaos · Answer 1 · Thu Jan 10 2019 13:02:54 GMT+0800 (China Standard Time)

You can use the following. It is a little slow, but hash table is usually much slower than hashing.

uint64_t hash_64(uint64_t key)
{
	key = ~key + (key << 21);
	key = key ^ key >> 24;
	key = (key + (key << 3)) + (key << 8);
	key = key ^ key >> 14;
	key = (key + (key << 2)) + (key << 4);
	key = key ^ key >> 28;
	key = key + (key << 31);
	return key;
}

Cosmin Apreutesei · Answer 2 · Thu Jan 10 2019 21:43:51 GMT+0800 (China Standard Time)

Hi @attractivechaos, thanks for answering and for the hash function. I still have trouble figuring this out. For instance you say that

hash table is usually much slower than hashing

Ok, so here's my results with the above hash_64 function:

key size:  4, inserts:  1000000, unique keys:  25%, mil/sec:   49.091
key size:  4, lookups:  1000000, unique keys:  25%, mil/sec:  307.542
key size:  8, inserts:  1000000, unique keys:  25%, mil/sec:   19.614
key size:  8, lookups:  1000000, unique keys:  25%, mil/sec:   39.554

8x slower than with 32bit keys, which is fine given that twice as many 32bit keys fit into the CPU caches than 64bit keys, the 32bit hash function is a noop, etc.

But then if I switch to simpler 64bit multiplicative hash I get the following:

key size:  4, inserts:  1000000, unique keys:  25%, mil/sec:   46.969
key size:  4, lookups:  1000000, unique keys:  25%, mil/sec:  308.119
key size:  8, inserts:  1000000, unique keys:  25%, mil/sec:   20.004
key size:  8, lookups:  1000000, unique keys:  25%, mil/sec:  109.173

So 1.5x speed just by changing the hash function.

Now if I switch khint_t to int64_t and use the identity hash function (so no collisions), I get this:

key size:  4, inserts:  1000000, unique keys:  25%, mil/sec:   49.293
key size:  4, lookups:  1000000, unique keys:  25%, mil/sec:  322.876
key size:  8, inserts:  1000000, unique keys:  25%, mil/sec:   34.743
key size:  8, lookups:  1000000, unique keys:  25%, mil/sec:  184.550

This is probably as good as it gets.

I am still curious about other people's experience with using different hash functions because:

the theory of why the results vary so wildly is above my head, although obviously less collisions is better than more collisions on any implementation.
you don't mention or seem to have tested khash using different hash functions in your blog posts.

Now it's quite possible that the slowness is dude to a bug in my code, but I think it would still help to know if other users of khash have had a similar experience or it's just me.

Attractive Chaos · Answer 3 · Thu Jan 10 2019 23:52:34 GMT+0800 (China Standard Time)

Microbenchmarks are tricky. Showing numbers means little without the source code.

Cosmin Apreutesei · Answer 4 · Thu Jan 10 2019 23:57:45 GMT+0800 (China Standard Time)

You're right. I wasn't showing the code because it's not in C, it's actually a port of khash in Terra but I'm not sure that it matters much, unless it's a translation bug (Terra has the same memory model as C it being a LLVM frontend). Anyway here it is, FWIW: https://github.com/luapower/khash.