attractivechaos / klib

A standalone and lightweight C library

Home Page:http://attractivechaos.github.io/klib/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Good hash function for 64bit keys?

capr opened this issue · comments

Hi!

I noticed that khash is very sensitive to the choice of hash function for >= 64bit keys (not so for 32bit keys). In particular kh_int64_hash_func() is extremely slow with random input. Does anyone have a similar experience? Is it expected to have wildly varying results based on the choice of hash function?

Thanks!

You can use the following. It is a little slow, but hash table is usually much slower than hashing.

uint64_t hash_64(uint64_t key)
{
	key = ~key + (key << 21);
	key = key ^ key >> 24;
	key = (key + (key << 3)) + (key << 8);
	key = key ^ key >> 14;
	key = (key + (key << 2)) + (key << 4);
	key = key ^ key >> 28;
	key = key + (key << 31);
	return key;
}

Hi @attractivechaos, thanks for answering and for the hash function. I still have trouble figuring this out. For instance you say that

hash table is usually much slower than hashing

Ok, so here's my results with the above hash_64 function:

key size:  4, inserts:  1000000, unique keys:  25%, mil/sec:   49.091
key size:  4, lookups:  1000000, unique keys:  25%, mil/sec:  307.542
key size:  8, inserts:  1000000, unique keys:  25%, mil/sec:   19.614
key size:  8, lookups:  1000000, unique keys:  25%, mil/sec:   39.554

8x slower than with 32bit keys, which is fine given that twice as many 32bit keys fit into the CPU caches than 64bit keys, the 32bit hash function is a noop, etc.

But then if I switch to simpler 64bit multiplicative hash I get the following:

key size:  4, inserts:  1000000, unique keys:  25%, mil/sec:   46.969
key size:  4, lookups:  1000000, unique keys:  25%, mil/sec:  308.119
key size:  8, inserts:  1000000, unique keys:  25%, mil/sec:   20.004
key size:  8, lookups:  1000000, unique keys:  25%, mil/sec:  109.173

So 1.5x speed just by changing the hash function.

Now if I switch khint_t to int64_t and use the identity hash function (so no collisions), I get this:

key size:  4, inserts:  1000000, unique keys:  25%, mil/sec:   49.293
key size:  4, lookups:  1000000, unique keys:  25%, mil/sec:  322.876
key size:  8, inserts:  1000000, unique keys:  25%, mil/sec:   34.743
key size:  8, lookups:  1000000, unique keys:  25%, mil/sec:  184.550

This is probably as good as it gets.

I am still curious about other people's experience with using different hash functions because:

  1. the theory of why the results vary so wildly is above my head, although obviously less collisions is better than more collisions on any implementation.
  2. you don't mention or seem to have tested khash using different hash functions in your blog posts.

Now it's quite possible that the slowness is dude to a bug in my code, but I think it would still help to know if other users of khash have had a similar experience or it's just me.

Microbenchmarks are tricky. Showing numbers means little without the source code.

You're right. I wasn't showing the code because it's not in C, it's actually a port of khash in Terra but I'm not sure that it matters much, unless it's a translation bug (Terra has the same memory model as C it being a LLVM frontend). Anyway here it is, FWIW: https://github.com/luapower/khash.