Abstract

This is an implementation of Hash table. The table is used to study Hashing algorithms collisions and performance. AVX instructions are used for performance boost.

Usage

The Hash Table itself does not provide any hashing functions. A few of them are implemented in Hashing.cpp. The Table expects a hashing function to be int64_t Hash (const void *key, int key_len)

We have a typedef for this function:

typedef int64_t (* HFunc_t) (const void *, int);

Init the table with Hash_t table = {};

Then use a constructor to prepare it properly:

CreateTable (&table, table_len);

Use TableInsert (Hash_t *table, type_t value, HFunc_t Hash) to add elements.

Use TableDelete (Hash_t *target_table, const void *key, int key_len, HFunc_t UserHash) to delete elements from the table.

Use DestrTable (Hash_t *table, int (*elemDtor) (void *)) to destroy the table. Function passed as argument is a destructor for table elements in case they are some kind of data structures.

Uniformity test

The first part of the task was to check the uniformity of several hash functions:

Hash = String length
Hash = First symbol of the string
Sum of symbols' ASCII codes
ROL hash: Hash[i + 1] = (Hash[i] ROL 1) ^ String[i + 1]
MurmurHash2A
CRC32 Hash

To test the function, we have hashed the entire Silmarillion by J.R. Tolkien word by word. Diagrams of Hash collisions were plotted. Amount of collisions of value H = length of the List, associated with hash value H. Chi-squared (more on that later) test was performed to estimate the uniformity quantitively.

Experimental results

1) String length hash

Max collisions = 1594 at length = 6

2) Fist symbol hash

Max collisions = 717 at letter S

3) Sum of symbols hash

Max value = 50

4) ROL hash

Max value = 21

5) Murmur hash

Max value = 13

6) CRC32 hash

Max value = 13

Chi-squared test

Chi-squared tests allows us to get the characteristic value of uniformity. Values between 0.95 and 1.05 indicate highly uniform distribution.

The formula used:

Where

m - number of hash values
n - number of occupated hash values
b[j] - amount of collisions of value j

Results

CRC32 and MurmurHash are the most uniform hashes with values 1.06 and 1.07 respectively.

That concludes our research.

Optimization history

We want our HashTable to be used to search elements by key in long texts (more than 10000 words). As a result, the stress test was the following:

Load the whole Silmarillion by J.R. Tolkien into the hash table
For each word of the book, call TableFind 512 times
Erase the whole book from Table word by word

Peformance test were conducted using the perf tool and Linux's time. The number of cycles a function is executed and overall execution time are optimized. Optimization flag: -O2.

In every measurement:

Period - approximate number of total CPU cycles a function has been executed. Output of perf.
Exec. Time - total time required to pass the stress test. Output of time.

TableFind optimization

Judging by the perf output, the slowest function was TableFind itself, as it does a lot of safety checks in runtime. So we have decided to optimize it first.

We have found that the prologue of the TableFind function requires significant time. So the first step is to make this function inline.

Inlining the function gave slight performance boost, but removed the function from the top of callgrind output. Table of the main function performance:

Inline?	Period, Bil. cycles	Exec. Time, s
NO	5.6	3.31 ± 0.05
YES	4.4	3.03 ± 0.05

After inlining:

CPU cycles for TableFind = Δ(cycles for StressTest), because they now form a single function.

StressTest optimization

The heaviest function is the StressTest, as it contains TableFind. Using perf, we find out the most expensive part of this function is GetElemByHash, because it is called whenever a key is processed.

Our solution is to replace it with macro.

void *GetElemByHash (Hash_t *target_table, int64_t hash)
{
    int64_t capacity_mask = target_table->capacity - 1;

    void *target_elem = target_table->Data[hash & capacity_mask];

    return target_elem;
}

Was replaced with

#define GET_ELEM_BY_HASH(tbl_, hash_) (tbl_->Data[hash_ & (tbl_->capacity - 1)])

However, it gives very little performance boost.

Macro	Period, Bil. cycles	Exec. Time, s
NO	4.4	2.90 ± 0.08
YES	4.8	2.87 ± 0.05

Note: although the total amount of cycles is increased, execution time benefitted from the optimiztion.

At the current step we cannot find any improvements that could be done to speed up the TableFind function. We will now try to optimize the second most heavy function: Hash.

Hash optimization

The next function to optimize was Hash function.

We have tried to improve execution time by rewriting MurmurHash in Assembly language. However, this did only reduce the performance of the program:

Assembly	Period, Mil. cycles	Exec. Time, s
NO	4.8	2.87 ± 0.08
YES	6.2	3.45 ± 0.01

Another attempt on changing the hash function: Use intrinsics CRC32 hash.

Intrinsics hash	Period, Mil. cycles	Exec. Time, s
NO	2.4	2.87 ± 0.08
YES	1.2	2.72 ± 0.01

Intrinsics Hash gave 6% performance boost.

Such small gain can be explained by the fact that hash function is not the bottleneck.

StrCmp optimizations

From the following screenshot we can see, that the most time-consuming part of the ListFind function is the strcmp.

Zero step is to replace strcmp with memcmp, as we already have length of each string.

Comparator	Exec. Time, s
strcmp	2.72 ± 0.01
memcmp	2.60 ± 0.02

Now we need to replace memcmp for short words with AVX instruction to compare multiple bytes at once.

The results are surprisingly good! Performance has been improved by almost 30%:

Intrinsics memcmp	Period, Mil. cycles	Exec. Time, s
NO	2.1	2.60 ± 0.02
YES	2.9	2.12 ± 0.02

Note: once again, CPU cycles do not corellate with overall performance trend.

Inline ASM optimization

As a part of our course we were recommended to use inline __asm feature. This does not improve the execution speed. Education purposes only.

if (target_list->size > 1)
{
    long res_elem = ListFind (target_list, type_t { 0, key, key_len });
            
    if (res_elem) found = GET_LIST_DATA (target_list, res_elem);
}

Has been replaced with

__asm__ ("cmp $1, %0\n"
         "jle EmptyList\n"
         :: "r" (target_list->size));

            long res_elem = ListFind (target_list, type_t { 0, key, key_len });
            if (res_elem) found = GET_LIST_DATA (target_list, res_elem);
            
__asm__ ("EmptyList:\n" ::);

Inline ASM	Exec. Time, s
NO	3.50 ± 0.2
YES	3.57 ± 0.2

This was the last optimization so far. Let us sum up.

Optimization summary

perf killed my processor several times. It was very scary...
Inlining a function gave us a barely noticable 2% performance boost.
Changing the Hash function as well as implementing it in ASM decreases the computation speed.
Replacing Murmur Hash with parallel CRC32 made our program 6% faster.
Finaly, implementing parallel memcmp made it almost 1.3 times faster.
Overall speedup: 1.56x times

Optimization coefficient

As a part of our course we were recommended to count the following value:

1.56 times performance boost / 113 lines of assembly and SIMD code * 1000 = 13.8

Acknowledgements

Special thanks to Futherus, Denchik and Vasilmao for reviewing my README file. I would also like to express my gratitude to the entire development team of perf.

k-kashapov / HashTable