ibim

This is the beginnings of a fairly general-purpose implementation of Walter Burkhard's Interpolation-based Index Management hash technique. It builds on his work with linear hashing, which requires a function that maps (as uniformly as possible) a key-space into the real interval [0 1), with the requirement that for any keys K1 and K2 and function H, k1 < k2 => H(k1) < H(k2). The H function feeds into the final hash function h, which generates the table index. The table size is always a power of two and can burst and contract, based on utilization. One way to generate the table index would be to simply multiply H(k) by the table size, but this would be very prone to statistical "clumping" in the data. For example, a key space consisting of first names would cause collisions among the fairly common "j" names, like "james", "john", and "jenny", and would be under-utilized around rare names like "xander" and "xavier". To mitigate this, Burkhard introduces the bit-reversal permutation function B that takes an integer in the range [0 2^n] and maps it to an integer whose bit pattern is reversed, such that bit n-1 is swapped with bit 0, bit n-2 is swapped with bit 1, etc. So for a table of size 16 (n=4) and an input 13 (0b1101), B(13,4) = 11 (0b1011). So when we multiply H(k) by table size 2^n, we get a number in which the index is heavily influenced by the key's lexical order, (i.e. common "j" names, with a function H that treats strings as base-26 fractions, will tend to map to a value around 5 or 6 for table of size 2^4 (9/26 * 16)). After the bit-reversal permutation is applied, the resulting value will be heavily influenced by what used to be the lowest-precision bits, which have much more entropy and appear to vary arbitrarily with respect to the lexical sort order of the input key. Hence, for k = "john", H(k) ~= 0.3673, n = 4, floor(H(k) * (1 << n)) = 5, and the final hash function h(k, n) = B(floor(H(k) * (1 << n)), n) = B(5, 4) = 10 (0b1010). Becuase of the bit reversal permutation, h("karl", 4) = 6, which does not collide with h("john", 4) = 10. The two indices are not particularly close, despite the fact that they are very near one another in lexical sort order. It is still true that due to their being lots of "j" names, things will tend to collide into the bucket with index 10 for n = 4, because h("john", 4) = 10, h("jose", 4) = 10, and h("joan", 4) = 10, and so forth. However, this grouping due to the extreme proximity of these names in lexical sort order, will cause bucket capacity at index 10 to quickly rise above some acceptable threshold, which in turn will cause the table to burst to n = 5. Now h(k,n) for the above names will spread. As n increases, eventually the low-order bits of the integer representation of H(k) will begin to diverge, which will cause the "clumping" around these names to be broken up in the indices indicated by h(k, n), so the table is somewhat self-regulating. However, it should be noted that in this particular example, it isn't until n = 7 that h("joan", n) is different from h("john", n) and h("jason", n), and it isn't until n = 11, that they all get a different index. An extremely bad statistical spread in the key space can still effect the efficiency of the data structure, even with the bit-reversal permutation. But it remains true that as n increases, the high density regions of the key space will be broken up into what will look like random distribution. Thus, some care should be taken to make sure that H(k) spreads your data as evenly as possible within the interval [0 1).

So with all this talk of complications and mitigation, why would one bother to use this hashing technique? After all, it's easier to get close-to-ideal behavior with a traditional hash function that endeavors to obscure any predictable relationship between the value of a key and the table index it generates. What you get in return for preserving some information about lexical order in the hash table index is inexpensive range queries. The slight complication with the statistical distribution of key values (which can be handled in many different ways), provides a scheme whereby the buckets of every record with keys in an user-defined interval. All you have to do is generate the index for the endpoints of the interval, and you can easily derive the indices for the buckets with records corresponding to the intermediate keys. You only have to filter the end-point buckets (some entries in the end-points may have keys outside of the specified query interval). The remaining buckets will contain only records with keys within the interval. Even more interesting, Burkhard shows how this can be extended to hasing with multiple keys, and to performing multi-dimensional range queries. The implementation will demonstrate how this is done, and the test data will attempt to expermentally characterize the run-time performance of these techniques. It is roughly O(n) in the number of records in the table to iterate through the results of a range query, but the enumeration of the buckets and records is independent of the cost of a comparison operation.

This implementation is also a playground for sharpening my variadic template skills, since that is how the data structure is described. To begin with, it is in an XCode project, but the code is generic C++17. That is, modulo the quirks of clang's implementation ("typename" seems more mandatory than in documentation I've seen). It doesn't use any bleeding-edge language features except one place I can think of where I use a fold expression and quite a few where "if constexpr" is used. If this project has any broad appeal, though, it will probably just be as a starting-point for some implementation ideas around IBIM or as an example (for good or ill) of variadic template use.

On a personal note, Dr Burkhard was a good professor and a nice guy. A couple of decades after graduating, I thought perhaps I had a problem for which IBIM would be a good solution, but I had trouble remembering some of the implementation details. I found that Dr Burkhard was still on the faculty at UCSD, where I attended his advanced data-structures course. I sent him an email, explaining that I would like to use the technique, and he replied with some very helpful clarification, and mailed me a copy of the paper he published, describing the technique (I can't find a copy on-line, but I will post a link here, if I can wrangle it eventually). The problem was a classic computational geometry exercise of packing rectangles within a rectangle, here applied to the practical computer graphics problem of packing small textures into a larger, on-card surface to reduce context changes at render time. I thought that by hashing the candidate sub-regions by height and width, and then using range-queries to find close-to-ideal fits for candidate sub-textures, I might be able to get a more efficent packing than by some more-naive approach. In a follow-on github project, I plan to revisit this as exercise, using this library.

cycollins / ibim

ibim

About

Languages