attractivechaos / klib

A standalone and lightweight C library

Home Page:http://attractivechaos.github.io/klib/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

khashl.h - Massive increase in insertion time for large number of keys

7PintsOfCherryGarcia opened this issue · comments

When counting kmers in a very large dataset (>2 billion unique kmers) I noted a significant slowdown as my hash table got larger and larger.

I measured insertion time into a hashset with the following code:

//gcc -O3 -o time_insert time_insert.c
#include <stdio.h>
#include <stdint.h>
#include <time.h>

#include "klib/khashl.h"
KHASHL_SET_INIT(static, kc_t, kc, uint64_t, kh_hash_uint64, kh_eq_generic)

int main()
{
    kc_t *h = kc_init();
    int absent;
    double tt = 0.0;
    clock_t t;
    for (uint64_t i = 0; i < 4294967295; i++) {
        t  = clock();
        kc_put(h, i, &absent);
        t = clock() - t;
        tt += ((double)t) / CLOCKS_PER_SEC;
        if ( !((i+1)%100000) ) {
            fprintf(stderr, "%u %f\n", kh_size(h), tt);
            tt = 0;
        }
    }
    kc_destroy(h);
}

Plotting the insertion time per 100k insertions I get:
time

Is this an unavoidable consequence of khashl.h where, for large hash tables, I have to move into a multiple table implementation?

Thanks for any help

If you have those many k-mers, use an ensemble of hash tables as is shown here.

An ensemble of hash tables also makes multi-threading easier and more effective. It is the better way for huge hash tables.

Hi,
Yeah I understand why a hash table ensemble is a better solution for large set sizes. I am more curious about khashl's implementation and understanding how it works.
So after more careful analysis and much much effort, I manage to find the reason of such increase in insertion times as the hash table got larger and larger. More specifically, as the number of buckets passed $2^{32} - 1$

It all boiled down to integer literals. In bucket number calculations, U is used giving 32 bits of range to the number of buckets. Changing it to UL did the trick:

#define __KHASHL_IMPL_PUT(SCOPE, HType, prefix, khkey_t, __hash_fn, __hash_eq) \
	SCOPE khint_t prefix##_putp(HType *h, const khkey_t *key, int *absent) { \
		khint_t n_buckets, i, last, mask; \
		/*****n_buckets = h->keys? 1U<<h->bits : 0U;*****/ \
		n_buckets = h->keys? 1UL<<h->bits : 0UL; \
                *absent = -1; \
		if (h->count >= (n_buckets>>1) + (n_buckets>>2)) { /* rehashing */ \
			if (prefix##_resize(h, n_buckets + 1U) < 0) \
				return n_buckets; \
			/*****n_buckets = 1U<<h->bits;******/ \
                        n_buckets = 1UL<<h->bits;
		} /* TODO: to implement automatically shrinking; resize() already support shrinking */ \
               ...

Same approach is used in __KHASHL_IMPL_RESIZE when needed.

I had to make one other change which I don't know how or why it is working. Had to change __kh_h2b(khint_t hash, khint_t bits) because it was giving values that where out of bounds when used as indexes to access hash keys. More specifically:

static kh_inline khint_t __kh_h2b(khint_t hash, khint_t bits)
{
    /*return hash * 2654435769U >> (32 - bits);*/
    return hash * 2654435769U >> (64 - bits);
}

I am unsure how __kh_h2b(khint_t hash, khint_t bits) works. I assume, from context, that it gives an index to a bucket in the hash table. "h2b":"hash to bucket"???? And because the result is used to access keys in the key array. My reason to change 32 to 64 was speculative as I don't know how the function is working. My guess was that because the range of possible buckets now goes from 0 to $2^{64} -1$ that 32 should be 64. It was pure luck that this change got rid of the out of bounds access.

What is this function doing?
Why is the constant 2654435769U used?
Why do we have to right shift it by 32 (now 64) minus the bits used to compute number of buckets?

Thanks for any guidance you could provide.

time

I should have mentioned that khashl is not intended for holding >4 billion elements mainly because in my view, an ensemble of hash tables will be overall better.

For __kh_h2b(), see this post. 2654435769 is approximately 2**32/1.618. For 64-bit, it should be 11400714819323198485UL. This function guards against bad hash functions. If you know you have a good hash function, you can simply use:

static kh_inline khint_t __kh_h2b(khint_t hash, khint_t bits)
{
    return hash >> (64 - bits);
}

This will save a little time. In practice, though, I feel it is safer for a library to have this guard.

I should have mentioned that khashl is not intended for holding >4 billion elements mainly because in my view, an ensemble of hash tables will be overall better.
I see, makes sense. Although trying to making it fit so much data has been a great exercise on understanding the codebase and hash tables in general.
Thanks a lot for the help and the excellent read.