staltz / ppppp-sync

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Fill dataless msgs with data if they're in wantRange

staltz opened this issue · comments

Problem

Suppose you're replicating with a peer that has a smaller haveRange than you have, e.g. they have goal newest-20 while you have goal newest-100. When you do the sync protocol, they will send new msgs in your wantRange, but many of them may be trails because they are applying a stricter GC policy.

So you accept those trail msgs anyway and store them in disk.

However, if you now replicate with another peer that happens to have a more generous haveRange (e.g. from the goal newest-500) then they will be able to provide the dataful msgs than you want. However, you will signal to them that you already have those msgs in disk (which you do, but they're stored as dataless/trails), and thus you won't get the dataful msgs, even though you would really want to.

This means that the strictest remote peer will practically dictate how many dataful msgs you can replicate. This is an undesired protocol behavior.

There is a difficulty here that goes beyond the sync protocol: msgs in the log are assumed to be either appended or shrunk (when erased), no other modification/addition is allowed. And in fact the log implementation will check that msgs can only be overwritten if the new size is $\leq$ the old size. So in the log we can't "replace" a dataless msg with its dataful equivalent.

Glossary

  • Dataless: when msg.data is null
  • Dataful: when msg.data is an actual content object
  • Trail: a msg that originally had msg.data but msg.data was erased and now this msg only serves the purpose of validating the sigchain back to the root
  • Erase: database log operation that deletes only msg.data from a msg record
  • Goal: for a given tangle, this informs how much we are interested in replicating/preserving this tangle
  • wantRange: a tuple [minDepth, maxDepth] of depths of msgs that we signal to a remote peer that we want to receive, for a given tangle
  • haveRange: a tuple [minDepth, maxDepth] covering all the depths of msgs that we have in the database log

Thought

If we can toposort all the records in the log, then this issue may be gone, because we can append low-depth msgs after high-depth msgs, and they'll be sorted so that low-depth comes first. This would allow us to effectively "replace" a dataless msg with its dataful equivalent by (1) deleting the dataless msg, (2) appending the dataful msg.

Downside of this approach is that we have to sort all the records from the log, when the log is loaded into memory. That's a $O(n \space log(n))$ operation, which is pretty bad considering that currently loading the log into mem is $O(n)$ and we don't want to go much higher than that. Also, while records are appended live, we would have to sort them into place too.

Thought 2

Instead of toposorting all the records, we can also "rebuild" the entire tangle in question by (1) deleting all tangle msgs from the log but keeping them in memory temporarily, (2) appending the new low-depth msgs synced from the remote peer, (3) appending the high-depth msgs that were kept in memory temporarily.

This would create some holes in the log, but that's fine, we can compact the log.

This approach might be the most realistic so far.

Idea 3 (from Mix)

If I get less messages than I originally wanted, then save them in a temporary log and when you finally have all the msgs you want, you can transfer them to the main log, in the correct order.

This is fixed (I think) in ppppp-sync and ppppp-db, but requires new tests in ppppp-sync, so I'm busy with that.