Hademlia

A Haskell Kademlia implementation.

What is Kademlia?

Lovely you should ask.

Kademlia is a distributed hash table – why is this cool? It allows a bunch of computers to get together and store information across the entire system. Then, each can find/retrieve information that exists only on a given computer, without ever having talked to it before!

So let’s say Computer 1 wants File A. File A exists on Computer 1337. Computer 1 can make a few jumps across the network (log_2(n) to be exact) to find exactly where File A is - then C1 can talk to C1337 - and splendid piracy continues.

How does it actually work?

A node U keeps track of the following information:

Its ID, which let’s say is in binary format, like 101101.
A Binary Tree containing information about a subset of other nodes in the network.
- The tree has lists as its leaves
- These lists are filled with <Node ID, IP Address, UDP Port> tuples.

A node can perform the following operations:

PING
STORE
FIND_NODE
FIND_VALUE

Most of these operations involve a lookup, which is described further down.

Objects are treated the same as nodes, identity-wise

A given object is hashed into the same format as the node IDs. (Let’s assume binary format)
When performing a FIND_VALUE, the goal is to look for nodes close (see below) to that value (in terms of their IDs)

Nodes have a notion of distance defined by XOR

If we consider Node / Object IDs as binary digits, we can XOR them to find the “distance”
This has a number of cool properies
- unidirectionality : all paths lead to Rome (the node). For a distance d and point x, there is one y such that dist(x,y) = d
- symmetry : dist(x,y) = dist(y,x)

The Lookup

The following is pseudo-code for a node U looking up the ID W.

user sets A, a concurrency parameter
user sets K, a replication parameter
maintain k-heap, a min-heap of nodes ordered by distance from W
                 each entry also maintains a queried? flag (which means queried & received response)
define get-k-closest, which takes a node and target and returns node's K closest nodes to target

define query(to-query):
  results := to-query.each { |x| get-k-closest(x, W) }
  terminate if any of the results contain our value or have our desired node ID
  add results to k-heap
  terminate if the first K of k-heap has been queried
  distances := map (distance to W) on to results
  if min(distances) < peek(k-heap):
    query-next := grab A unqueried from k-heap
  else:
    query-next := grab any node in the first K of k-heap that's unqueried
  query(query-next)

start by calling query on A closest nodes to W.

In order to get the initial A closest nodes:

Assume IDs are 160-bits.
Assume the lookup-tree is a binary tree where the branches at each point are labeled 0 and 1.
Walk the digits of W while walking the tree until you hit a leaf.
Grab the <= k nodes in that bucket.
If < k, walk back a digit and find the next leaf bucket (repeat if necessary).

If the request is for a value, terminate once a node returns the value associated with the key.

Other Implementation Notes

When a node receives any message from another node W, it updates the k-bucket for W.
- If W is already there, move it to the end of the list.
- If it’s not there, add it to the end of the list.
- If it’s not there and the bucket is full, ping the last recently seen node (first) in the k-bucket.
  - If it responds, move it to the end of the list and discard W.
  - If it does not, evict it and add W to the end of the list.
- This way, new nodes can’t flood routing tables.
Once a lookup succeeds, the requesting node stores the <key,value> it was looking for at the closest node that did not return the value.
<key,value>s expire in a time exponentially proportionalo to the number of nodes between the storing node and the next closest node to the key.
If a node hasn’t performed a lookup in a given bucket in an hour, it will pick one at random and perform a lookup.
To join the network, node U adds W into a k-bucket and then performs a node-lookup for itself. Then it refreshes all k-buckets further than closest neighbor.
<key,value>s are republished once every hour
- TODO: there’s more to talk about here

Kooshaba / hademlia