Incremental saves

Question

Incremental saves

mweidner037 opened this issue a year ago · comments

In some usage patterns (e.g. P2P), users share states back and forth to bring each other up-to-date. Currently, each user needs to send their whole saved state, even if they actually only need to share a few updates. I believe this is okay for now because saved states are usually small (10s of KB), but it would be nice to allow "incremental" saves that optimize this.

Specifically, CRuntime.save could input a vector clock or existing saved state, then return a saved state that includes only the delta on top of that. Collabs would receive this input in some (generic) metadata field passed down the tree of Collab.save calls.

When I tried this before, I gave up because finding the delta for deletions is hard: given only the base vector clock, you don't know which elements have been deleted since then, so you need to either send the IDs of all present elements or store tombstones. I now think that the former is okay: although { all IDs } does not asymptotically improve over { all elements }, it should be a large constant-factor improvement. We can also avoid this problem when we're given a whole saved state as the base instead of just a vector clock.

Other CRDT libraries' analogs: Yjs encodedTargetStateVector; Automerge saveIncremental (?).

mweidner037 · Answer 1 · Sun Jul 16 2023 06:25:27 GMT+0800 (China Standard Time)

Incremental saves on top of a base state (not just VC) are also useful for git-style collaboration: store each commit as an incremental save on top of the previous state. (This could even use literal git: before each git commit, store your latest incremental save as a line in a text file.)

mweidner037 · Answer 2 · Wed Jul 19 2023 03:22:32 GMT+0800 (China Standard Time)

Or, load could return a savedState giving the delta over what was just loaded.

mweidner037 · Answer 3 · Sat Jul 29 2023 23:01:50 GMT+0800 (China Standard Time)

I am thinking of calling these "deltas" instead of incremental saves, and divorcing them from CRuntime.load / CRuntime.save. Instead, they would get their own methods CRuntime.getDelta(baseState: Uint8Array) / CRuntime.applyDelta(delta: Uint8Array). Likewise for Collab methods.

Then Collabs would have three kinds of updates, each representing a collection of operations:

Message: single operation; send/receive; op-based CRDTs.
Saved state: all operations up to a point; save/load; state-based CRDTs.
Delta: an arbitrary (contiguous) bundle of operations; getDelta/applyDelta; delta state-based CRDTs.

I am in favor of deltas from the perspective that Collabs is a library for managing collaborative operations. Thus we should let users work with arbitrary bundles of operations - not force them to manually manage messages or to use overlarge saved states when they know better.

The disadvantage is that deltas add a third update interface to each Collab (getDelta/applyDelta), which makes implementing a primitive Collab ~50% harder. However, it is easy to opt out of: just set getDelta = save and applyDelta = load to recover the existing (in)efficiencies. You don't have to do anything new for a composed Collab.

We could mitigate perceived complexity by combining the 3 interfaces into a single getUpdate/applyUpdate interface, like Yjs. However, this doesn't actually make implementing a Collab easier - the interfaces are different enough that you'll almost always use different formats & processing algs for each update type.

I think it is also better to keep them separate on CRuntime: to work with updates intelligently, the user needs to know which type a given Uint8Array is anyway, so they should already know which method to call (receive/load/applyDelta). (Imagine if they were actually different types instead of all Uint8Array - I think 3 separate methods would be cleaner than a 3x overloaded method.)

mweidner037 · Answer 4 · Tue Aug 15 2023 06:45:06 GMT+0800 (China Standard Time)

One workaround: use message-based sync instead of state-based sync.

Martin Blom · Answer 5 · Fri Aug 25 2023 19:16:29 GMT+0800 (China Standard Time)

Either way, I think some kind of help from the library would be nice, kind of like how it works in AutoMerge, where one just keep calling two functions until the peers are synced up.