This is an implementation of Hash table. The table is used to study Hashing algorithms collisions and performance. AVX instructions are used for performance boost.
The Hash Table itself does not provide any hashing functions. A few of them are implemented in Hashing.cpp. The Table expects a hashing function to be
int64_t Hash (const void *key, int key_len)
We have a typedef for this function:
typedef int64_t (* HFunc_t) (const void *, int);
Init the table with Hash_t table = {};
Then use a constructor to prepare it properly:
CreateTable (&table, table_len);
Use TableInsert (Hash_t *table, type_t value, HFunc_t Hash)
to add elements.
Use TableDelete (Hash_t *target_table, const void *key, int key_len, HFunc_t UserHash)
to delete elements from the table.
Use DestrTable (Hash_t *table, int (*elemDtor) (void *))
to destroy the table.
Function passed as argument is a destructor for table elements in case they are some kind of data structures.
The first part of the task was to check the uniformity of several hash functions:
- Hash = String length
- Hash = First symbol of the string
- Sum of symbols' ASCII codes
- ROL hash:
Hash[i + 1] = (Hash[i] ROL 1) ^ String[i + 1]
- MurmurHash2A
- CRC32 Hash
To test the function, we have hashed the entire Silmarillion by J.R. Tolkien word by word.
Diagrams of Hash collisions were plotted. Amount of collisions of value H
= length of the List, associated with hash value H
.
Chi-squared (more on that later) test was performed to estimate the uniformity quantitively.
Max collisions = 1594 at length = 6
Max collisions = 717 at letter S
Max value = 50
Max value = 21
Max value = 13
Max value = 13
Chi-squared tests allows us to get the characteristic value of uniformity. Values between 0.95 and 1.05 indicate highly uniform distribution.
The formula used:
Where
m
- number of hash valuesn
- number of occupated hash valuesb[j]
- amount of collisions of value j
CRC32 and MurmurHash are the most uniform hashes with values 1.06 and 1.07 respectively.
That concludes our research.
- We want our HashTable to be used to search elements by key in long texts (more than 10000 words). As a result, the stress test was the following:
- Load the whole Silmarillion by J.R. Tolkien into the hash table
- For each word of the book, call
TableFind
512 times - Erase the whole book from Table word by word
Peformance test were conducted using the perf
tool and Linux's time
. The number of cycles a function is executed and overall execution time are optimized. Optimization flag: -O2
.
In every measurement:
- Period - approximate number of total CPU cycles a function has been executed. Output of
perf
. - Exec. Time - total time required to pass the stress test. Output of
time
.
- Judging by the
perf
output, the slowest function was TableFind itself, as it does a lot of safety checks in runtime. So we have decided to optimize it first.
We have found that the prologue of the TableFind function requires significant time. So the first step is to make this function inline.
Inlining the function gave slight performance boost, but removed the function from the top of callgrind output. Table of the main
function performance:
Inline? | Period, Bil. cycles | Exec. Time, s |
---|---|---|
NO | 5.6 | 3.31 ± 0.05 |
YES | 4.4 | 3.03 ± 0.05 |
After inlining:
CPU cycles for TableFind = Δ(cycles for StressTest)
, because they now form a single function.
- The heaviest function is the StressTest, as it contains TableFind. Using
perf
, we find out the most expensive part of this function is GetElemByHash, because it is called whenever a key is processed.
Our solution is to replace it with macro.
void *GetElemByHash (Hash_t *target_table, int64_t hash)
{
int64_t capacity_mask = target_table->capacity - 1;
void *target_elem = target_table->Data[hash & capacity_mask];
return target_elem;
}
Was replaced with
#define GET_ELEM_BY_HASH(tbl_, hash_) (tbl_->Data[hash_ & (tbl_->capacity - 1)])
However, it gives very little performance boost.
Macro | Period, Bil. cycles | Exec. Time, s |
---|---|---|
NO | 4.4 | 2.90 ± 0.08 |
YES | 4.8 | 2.87 ± 0.05 |
Note: although the total amount of cycles is increased, execution time benefitted from the optimiztion.
At the current step we cannot find any improvements that could be done to speed up the TableFind function. We will now try to optimize the second most heavy function: Hash.
- The next function to optimize was Hash function.
We have tried to improve execution time by rewriting MurmurHash in Assembly language. However, this did only reduce the performance of the program:
Assembly | Period, Mil. cycles | Exec. Time, s |
---|---|---|
NO | 4.8 | 2.87 ± 0.08 |
YES | 6.2 | 3.45 ± 0.01 |
- Another attempt on changing the hash function: Use intrinsics CRC32 hash.
Intrinsics hash | Period, Mil. cycles | Exec. Time, s |
---|---|---|
NO | 2.4 | 2.87 ± 0.08 |
YES | 1.2 | 2.72 ± 0.01 |
Intrinsics Hash gave 6% performance boost.
Such small gain can be explained by the fact that hash function is not the bottleneck.
From the following screenshot we can see, that the most time-consuming part of the ListFind function is the strcmp
.
Zero step is to replace strcmp with memcmp, as we already have length of each string.
Comparator | Exec. Time, s |
---|---|
strcmp | 2.72 ± 0.01 |
memcmp | 2.60 ± 0.02 |
Now we need to replace memcmp for short words with AVX instruction to compare multiple bytes at once.
The results are surprisingly good! Performance has been improved by almost 30%:
Intrinsics memcmp | Period, Mil. cycles | Exec. Time, s |
---|---|---|
NO | 2.1 | 2.60 ± 0.02 |
YES | 2.9 | 2.12 ± 0.02 |
Note: once again, CPU cycles do not corellate with overall performance trend.
As a part of our course we were recommended to use inline __asm
feature. This does not improve the execution speed. Education purposes only.
if (target_list->size > 1)
{
long res_elem = ListFind (target_list, type_t { 0, key, key_len });
if (res_elem) found = GET_LIST_DATA (target_list, res_elem);
}
Has been replaced with
__asm__ ("cmp $1, %0\n"
"jle EmptyList\n"
:: "r" (target_list->size));
long res_elem = ListFind (target_list, type_t { 0, key, key_len });
if (res_elem) found = GET_LIST_DATA (target_list, res_elem);
__asm__ ("EmptyList:\n" ::);
Inline ASM | Exec. Time, s |
---|---|
NO | 3.50 ± 0.2 |
YES | 3.57 ± 0.2 |
This was the last optimization so far. Let us sum up.
-
perf
killed my processor several times. It was very scary... -
Inlining a function gave us a barely noticable 2% performance boost.
-
Changing the Hash function as well as implementing it in ASM decreases the computation speed.
-
Replacing Murmur Hash with parallel CRC32 made our program 6% faster.
-
Finaly, implementing parallel memcmp made it almost 1.3 times faster.
-
Overall speedup: 1.56x times
As a part of our course we were recommended to count the following value:
- 1.56 times performance boost / 113 lines of assembly and SIMD code * 1000 = 13.8
Special thanks to Futherus, Denchik and Vasilmao for reviewing my README file.
I would also like to express my gratitude to the entire development team of perf
.