lleo / go-hamt

Go implementation of a Hash Array Map Trie

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

V2 API

This package was transient only, now it has a functional mode as well. Transient is the classical style of modifying data structures in place. Functional is defined below as immutable & persistent.

This is a merger with github.com/lleo/go-hamt-functional, which it obsoletes. The motivation for this was because we were seeing slower performance for go-hamt-functional Get() operations even though we were certain that the old go-hamt and go-hamt-functional were using the same algorithm. This merger guarantees that the transient and functional Hamt implementations are using the exact same internal data structures. This is true even to the degree that we can recast a HamtTransient data structure to HamtFunctional and the code will switch from transient (modify in place) to functional (copy on write) behavior. Of course, this works the other way around as well (that is, we can cast a HamtFunctional to HamtTransient).

This package also obsoletes github.com/lleo/go-hamt-key because we pass a []byte slice to Get/Put/Del operations instead of a Key data structure. What happens is we use the []byte slice to build a Key data structure to be used internally. This results in a simpler API and no external dependency.

What is a HAMT?

HAMT stands for Hash Array Mapped Trie. That spells it out clearly right? Ok, not so much. A HAMT is an in-memory Key/Value data structure with some really cool features. The first feature is that we implement it as a tree with a reasonably large and fixed branching factor; thereby making it a pretty flat tree. The killer feature is that the tree has a maximum depth, resulting in a O(1) Search and Modify speeds. That is O(1) without a giant constant factor.

HAMT makes this happen by first hashing the []byte slice into either a 32bit or 64bit hash value. We then split this hash value up into a fixed number of parts. Each part now becomes the index into the Array we use for interior nodes of the tree. Lastly, we only use enough parts of the hash value to find a free node to store the value into; this is why it is called a Trie. So now we have a wide tree with a maximum depth where we only use enough of the parts of hash value to find a free spot to store the leaf; That is what makes a HAMT O(1) and fast.

In our implementation, we use the FNV hashing algorithm. It is fast and provides good randomness and that is all we need from it.

Also we choose to split the 32bit or 64bit hash value into 5bit values. 5bit values mean the tree will have a branching factor of 32 and a maximum depth of 6 for a 32bit hash value (hamt32) and 12 for a 64bit hash value (hamt64). You may be noticing that 5 does not go into 32 nor 64 cleanly. That is not a problem because we fold the extra 2 or 4 bits into the main hash value. Don't worry this is a legitimate thing to do. In the (very) rare case of a hash collision we use a special leaf value for both colliding key/value pairs.

go-hamt

We implement HAMT data structure based on either a 32 bit or 64 bit hash value, hamt32 and hamt64 respectively.

Further we can have the HAMT data structure behave in one of two modes, transient or functional. Transient means we modify the data structures in-place. Functional means persistent, which requires we use a copy-on-write strategy. In other words we copy each data structure and modify the copy, then given the parent now needs to be modified we follow this copy-on-write strategy up to a brand new HamtFunctional data structure. As you can imagine Transient is faster than Functional. Suprisingly, the transient behavior is not that much faster than the persistent behavior, because the HAMT data structure is flat and wide.

However, you cannot easily share transient datastructures between threads safely; you would need to implement a locking strategy. Where with functional data structures you are guaranteed safety across threads.

Functional (aka Immutable & Persistent)

The term Functional really stands for two properties immutable and persistent. Immutable means the datastructure is never modified after construction. Persistent means that when you modify the original immutable datastructure you do so by creating a new copy of the base datastructure which shares all its unmodified parts with the original datastructure.

Imagine a hypothetical balanced binary tree datastructure with four leaves, two interior nodes, and a root node. If you change the forth leaf node, than a new fourth tree node is created, as well as its parent interior node and a new root node. The new tree starting at the new root node is a persistent modification of the original tree as it shares the first interior node and its two leaves.

          (root tree node)     (root tree node')
            /          \         /          \
           /  +---------\-------+            \
          /  /           \                    \
    (tree node 1)    (tree node 2)        (tree node 2')
        /  \            /  \                /   \
       /    \          / +--\--------------+     \
      /      \        / /    \                    \
 (Leaf 1) (Leaf 2) (Leaf 3) (Leaf 4)             (Leaf 4')

Given this approach to changing a tree, a tree with a wide branching factor would be relatively shallow. So the path from root to any given leaf would be short and the amount of shared content would be substantial.

About

Go implementation of a Hash Array Map Trie

License:Apache License 2.0


Languages

Language:Go 97.7%Language:Shell 2.3%