supr/Loki

Loki, a storage system.

Quick start:

mvn clean compile scala:compile dependency:copy-dependencies
mkdir loki0 loki1 loki2 loki3

# In four separate terminals:
scala -classpath `ls target/dependency/*.jar | tr '\n' ':'`:target/classes com.memeo.loki.Main 0 2 0
scala -classpath `ls target/dependency/*.jar | tr '\n' ':'`:target/classes com.memeo.loki.Main 1 2 0
scala -classpath `ls target/dependency/*.jar | tr '\n' ':'`:target/classes com.memeo.loki.Main 2 2 0
scala -classpath `ls target/dependency/*.jar | tr '\n' ':'`:target/classes com.memeo.loki.Main 3 2 0

curl -X PUT http://localhost:8080/dbname
curl -X PUT -d '{"foo":"bar"}' http://localhost:8080/dbname/docname

You can also use port 8081, 8082, or 8083.

The three command line parameters are my-id, i, and n. i and n are parameters for the linear hashing algorithm used to balance storage throughout the cluster. For a given value of i, there can be 2ⁱ nodes in that cluster.

You can also run a one-node cluster by running it one time, giving parameters 0 0 0; or, a two-node cluster by running it twice, giving parameters 0 1 0 and 1 1 0.

How it works

The primary class is LokiService, which is an Akka Actor that receives pseudo-HTTP requests (the request contents are the same, the methods and response codes are the same, and there can be headers, but it isn't formatted as HTTP). A Grizzly HTTP listener handles HTTP requests, decodes the HTTP, and forwards it to the actor. Storage is split by database, so if a request to node A arrives for a database stored on node B, node A forwards the request to the remote actor on node B, gets the reply from node B, and responds to the HTTP request.

The cluster can be thought of as a linear hash table, where the key is the database name. When a node handles a request for database "foo", it will hash that name with MD5, then use that hash to compute a linear hash code, which will map to the number of the node the request should be forwarded to.

When a document is written, it is first written to the primary database, then (as long as there are at least three nodes in the cluster) the write is replicated to node n-1 and node n+1.

TODO

Zookeeper-based cluster configuration.
Map/reduce views (probably any JVM language: JavaScript, Groovy, etc).
Request rerouting to replicas when the primary is offline.
Restoration of a primary from replicas.
Rebalancing of the cluster when it grows.

Implementation Ideas/Notes

Disk Storage

Trying out a few things; we want it to support concurrent access, sorted primary key iteration, sub-maps. Should be stable, fast, and highly durable.

MapDB. A newer Java-only database; provides b-tree-backed implementation of a concurrent, navigable map. Looks good, still very new.
BerkeleyDB.
Reimplement CouchDB style storage (append-only b-trees) in Scala.

View Server

We need a way to execute map and reduce functions (written in a JVM language, JavaScript, Groovy first; then Python, Ruby, maybe Scala and Java). It needs enough isolation so it

Wrap the view execution in a sandbox.
Run the view execution in a separate JVM process (with a security manager), communicate to it via remote actors, stdin/stdout, etc.
Reimplement CouchDB's view server.

Storage Model

Linear hashing produces unbalanced write loads if the node count isn't of the form 2ⁱ. Cluster expansion could simply mean doubling the size each time, which sounds bad, except you then consider that the time between expansions would also double (given the same write load).

Should explore other partitioning schemes.

Serving, storing nodes.

We can easily add more HTTP server nodes to the cluster, just give them a larger node value then the maximum node value in the cluster. You can also have more storage nodes than server nodes, just by walling off some of the nodes' HTTP servers. We could also provide an option to disable the HTTP server completely.

supr / Loki