A python module for Merkle tree inspired operations (flexibility over efficiency).
The Merkle tree has been around since the 1970s and is useful for all sorts of marvelous things. A couple notable examples of their application within the free software world:
- The git version control system data model uses a form of Merkle tree to efficiently represent a repository as a graph of discrete states. The deterministic identifiers of each state enables easy, efficient, straightforward branch management, ultimately leading to the possibility of distributed version control.
- The cassandra distributed database uses Merkle trees to detect inconsistencies between nodes.
The merky
module provides tools for calculating the components of a hash tree, initially intended
to provide a sort of "git for data structures". It is not adhering to a particular formal definition
of a tree or a particular data structure form; it works with arbitrary data structures and gives the
user control over how the data structures are transformed.
When merky
transforms a data structure, it performs a depth-first walk over the structure.
For a given nested data structure, merky
:
- Calculates a canonical form of that data structure.
- Serializes that canonical form (currently using UTF-8 JSON serialization).
- Calculates a hash for that serialized form (currenly using the SHA1 hexdigest).
- Yields the
(hash, canonical)
pair to the caller. - Substitutes the hash for the original structure within its containing structure for the remainder of the calculation.
Tokenization is the basic function of merky
.
The hash calculation for a given data structure needs to be deterministic, which in turn requires that the serialization of structures be deterministic.
Data structures like python's dict
do not guarantee a deterministic ordering; merky
therefore
must provide such guarantees itself. The default rules:
- dict-like things are sorted and converted to the standard library's
OrderedDict
. - list-like things are converted to a list.
- everything else passes through unaltered.
Merky
relies on duck typing rather than inspecting types.
- If it has an
encode
attribute, assume it's a string. (This is important to distinguish between strings and sequences). - If it has an iterator over key/value pairs (
six.iteritems
), assume it's a dict. - If it supports iteration (
iter(foo)
), assume it's a list.
Code leaves less room for interpretive error than does prose.
import merky
import six
structure = {
"first": ("a", "b", "c"),
"second": {"first": "1st!", "second": "2nd!"}
}
t = merky.Transformer().transform(structure)
# Depth-first, sorted key order; the ("a", "b", "c") is tokenized first.
# Note that the tuple is canonicalized as a list.
# ('e13460afb1e68af030bb9bee8344c274494661fa', ['a', 'b', 'c'])
six.print_(next(t))
# According to sorted key order, the {"first": "1st!"...} dict is tokenized second.
# Note that the dict is canonicalized as an OrderedDict.
# ('555cf5554cbd46144bd01851ebb278d32d4dc538', OrderedDict([('first', '1st!'), ('second', '2nd!')]))
six.print_(next(t))
# The outer container is last. Note that substructures have been replaced by
# their SHA1 digests. They've been "tokenized".
# ('4c928a93cd9af338c722acfdc8daf09d186e621f', OrderedDict([('first', 'e13460afb1e68af030bb9bee8344c274494661fa'), ('second', '555cf5554cbd46144bd01851ebb278d32d4dc538')]))
six.print_(next(t))
# Reached the end; raises StopIteration
next(t)
While the basic merky.Transformer
will tokenize all dict-like and sequence-like
structures, from bottom to top, this isn't necessarily what you want for a given
use case.
The annotation interface of merky
gives the user control over the tokenization of
a complex data structure. In effect, the user annotates the data structure to instruct
merky
on which portions should be tokenized.
Using the merky.AnnotationTransformer
:
- All structures convert to their canonical forms.
- The top-level structure passed to
transform()
is always tokenized. - Nested structures are only tokenized if they have a
__merky__
attribute with a truthy value.
The merky.annotate
helper can wrap any arbitrary structure such that the
__merky__
truth requirement is met, causing the AnnotationTransformer
to tokenize
that wrapped structure.
Riffing further on our earlier transformation example, suppose now that we use
annotation. While the top structure passed to transform()
will always be
tokenized, only the annotated contents are subject to tokenization.
import merky
import six
# The inner tuple is annotated, but the inner dict is not.
with_list = {
"first": merky.annotate(("a", "b", "c")),
"second": {"first": "1st!", "second": "2nd!"}
}
# The inner dict is annotated, but the inner tuple is not.
with_dict = {
"first": ("a", "b", "c"),
"second": merky.annotate({"first": "1st!", "second": "2nd!"})
}
transformer = merky.AnnotationTransformer()
# Here we expect the tuple to be tokenized and the sibling dict to not be.
t = transformer.transform(with_list)
# The tuple, canonicalized to list as before. Note the identical hash value.
# ('e13460afb1e68af030bb9bee8344c274494661fa', ['a', 'b', 'c']
six.print_(next(t))
# The top dict, w/ tuple replaced by token but inner dict merely canonicalized.
# ('cfbe067935615515ba69aecc60c1cdd64e1cdc5d', OrderedDict([('first', 'e13460afb1e68af030bb9bee8344c274494661fa'), ('second', OrderedDict([('first', '1st!'), ('second', '2nd!')]))]))
six.print_(next(t))
# Here we expect the tuple to be canonicalized and the sibling dict to be tokenized.
t = transformer.transform(with_dict)
# The inner dict, canonicalized to OrderedDict as in the original example.
# ('555cf5554cbd46144bd01851ebb278d32d4dc538', OrderedDict([('first', '1st!'), ('second', '2nd!')]))
six.print_(next(t))
# The top dict, w/ dict replaced by token but inner tuple merely canonicalized.
# ('3f9970218863154d777983ca60bfd3d3318462fa', OrderedDict([('first', ['a', 'b', 'c']), ('second', '555cf5554cbd46144bd01851ebb278d32d4dc538')]))
six.print_(next(t))
Thus the user can control the manner in which a given data structure is transformed.
Suppose you have a hierarchy of things, where each node has (in addition to its children) attributes of its own.
The merky.cases.attrgraph.AttributeGraph
class provides a data structure for this that, in combination with the
merky.AnnotationTransformer
:
- Defines the tokenized form of a node as the combination of its attributes token and its members/children token.
- Defines its attributes token as the annotated transformation of an
attrs
dictionary (the dictionary will be canonicalized but its members are not). - Defines its members token as the annotated transformation of a
members
dictionary, the values of which are each annotated as well.
If a user uses the AttributeGraph
for all nodes in the hierarchy, then AttributeGraph
via the
AnnotationTransformer
can go back and forth between tokenized form and object form.
The separation of node attributes from membership (graph edges) within the structure means that:
- A child's tokenized form is unaffected by the properties of its parent or container.
- The parent or container's tokenized form is affected by the properties of its members/children.
See merky.cases.attrgraph.AttributeGraph
for more.
The merky.cases.tokendict.TokenDict
class provides what is probably the most obvious basic
use of merky
functionality, and perhaps one of the more natural data structures when thinking
of a hash tree. Namely: a dictionary that ensures its values will be annotated when used with
the merky.AnnotationTransformer
.
Some example uses of this:
- A traditional hash tree, where you want to identify which portions of two dicts are different without exchanging/comparing the entire data structures.
- A keyed collection of objects that you want to ensure receive tokenization, where the keys serve merely as identifiers and do not convey any additional meaning.
- A version history for a given structure, in which each key represents a version marker (a tag, a timestamp, a simple incremented number, whatever) and the value is the state for the structure as of that version.
See merky.cases.tokendict.TokenDict
for more.
You can explore a transformed structure to walk the tokens and associated structures
as a graph using the merky.cases.walker.Walker
class. Refer to its docs for more.
Complex storage options are outside the scope of merky
. However, for convenience, `merky
does provide some minimal storage options that can serve for some cases.
An arbitrary transformed structure can be stored/restored using various classes found in the
merky.store.structure
module. For instance:
# Suppose this gives me some `merky.cases.attrgraph.AttributeGraph` instance.
g = get_my_graph()
# Now I'm going to store it a file.
transformed = merky.AnnotationTransformer().transform(g)
# Get a JSON file writer for the path 'my-graph.json'
store = merky.store.structure.JSONFileWriteStructure('my-graph.json')
# Populate the writer with the transformed data.
store.populate(transformed)
# Write the file.
store.close()
Now later you want to restore it. Like so:
# Get the reader.
store = merky.store.structure.JSONFileReadStructure('my-graph.json')
# And away we go.
g = merky.cases.attrgraph.AttributeGraph.from_token(store.head, store.get)
For a more complex example, see merky.test.misc.graph_storage_test
, which
assembles a TokenDict
of AttributeGraphs
, writes to JSON, then restores from JSON.
It's an anachronistic absurdity that needs to be abolished. Direct your respect elsewhere.
Even more so.