laforge49 / aatree

Immutable AA Tree

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

lazy deserialization/reserialization

laforge49 opened this issue · comments

When pulling data from a file, it is best to work with large blocks. But deserializing a large block is a real bottleneck. And all to often, you only need to make a small change and then reserialize the whole thing and write it back to disk. This explains why it is so difficult to write a fast database in Java or Clojure.

Enter partial deserialization/reserialization. If we can deserialize just the root node of a tree, then we can access any node in a tree by deserializing log n nodes. This is a substantial improvement when only a small part of a block needs to be accessed/updated.

This is easily achieved by adding a wrapper for each node which has an atomic reference to its serialized and deserialized content. But before implementing this, we need to rework the existing code base (again). For one thing, we can move the map comparator out of the nodes and into AAMap, which allows us to unify map and vector nodes. For another, we need accessors for the content of each node--accessing the content of a wrapper then forces the content to be deserialized as needed.

The code base has now been reworked, unifying map and vector nodes and adding field accessors.

Lazy maps and vectors are now defined as well as LazyNode, which wraps Node. For now, LazyNode just holds a reference to a Node and provides indirect field access. That will change later.

What we need to do at this point is to define factories, which handle serialization/deserialization, and a factory registry. The factory registry will be passed as a resource to LazyNodes on instantiation.

The content of LazyNode has been flushed out and the initial factory will be based on edn. But serialization of nodes will be binary to allow for fast partial deserialization, and for that we will use byte buffers.

Time to define a resource that is passed on all node methods. For maps, the resource would contain the comparator. For lazy nodes, the resource would contain the registry of factories used for serialization/deserialization. And later, for virtual nodes the resource would contain access to the db file and a disk space manager. The resource is held by the map or vector.

Lazy deserialization/reserialization is working for vectors, so the next obvious thing is to get it working for maps. Only first, some cleanup. In particular, the byte-length method is way too slow, especially when we start working on virtual nodes.

byte length is now managed incrementally via atoms. Lazy maps are now working. And the resources parameter is passed to read-string for better control over deserialization of custom classes.

Still need to clean up some code duplication and then add support for nested lazy structures. But it looks like this is finally coming to completion. And yeah, need at least a simple timing test. It will be much slower than the Java code that was previously implemented, but still blazingly fast compared to anything else I've ever heard of. It is also very much smaller than the java code, partly because it is written in clojure but also because it leverages prn-str and read-string. And it is also a rewrite!

I've now minimized the length of the factory methods, which is important as I need to reify 4 more instances of the IFactory interface.

I am thinking that nesting is not so important and can be left to the next release. I've completed the timing test for updates:

Time to build a vector of size 1000000 = 2.7142E7 microseconds
Time per entry: 27.142 microseconds

Time to deserialize/update/reserialize 1000 times = 1.6171E7 microseconds
Time per complete update: 16171.0 microseconds

So in 16 milliseconds we can (partially) deserialize a vector with a million items, update an entry in that vector, and then reserialize the vector.

Here's the lazy map benchmark:

Time to build a map of size 1000000 = 2.0032E7 microseconds
Time per entry: 20.032 microseconds

Time to deserialize/update/reserialize 1000 times = 3.0838E7 microseconds
Time per complete update: 30838.0 microseconds

And that brings this issue to a close.