On Document URLs, and Document IDs

Question

On Document URLs, and Document IDs

pvh opened this issue a year ago · comments

Peter van Hardenberg commented a year ago

It looks like I won't be able to finish this patch before I go out on holiday, so I'm going to write up my plan and push the branch so that it can either be critiqued or finished (or both).

I'm working on a new format to replace the UUID document IDs used by automerge-repo.

The goal of this change is twofold. First, to introduce a recognizable and consistently parsable URL for Automerge documents that can be stored in an Automerge document. In the future, this URL should support specifiying heads, or perhaps branch IDs, or other kinds of tomfoolery, but for now it's just designed to be a recognizable URL. Second, to allow Automerge-Repo to immediately discard URLs that are either the wrong data type or the result of a transcription error.

The URL format is straightforward:
automerge:<checksummed-bs58-encoded-UUID>
It looks like this:
automerge:3f1w4KRPqEgwCrGGdyM6ATLYfMWo.

Let's discuss each part of the URL.

scheme / protocol

First, the scheme, automerge. I've chosen to use a custom scheme because Automerge is not run over HTTP (though it can tunnel over it via websockets), and because a traditional HTTP url would require us to provide elements like a hostname which doesn't really exist in this context.

Unfortunately, the automerge scheme can't be fetch()'d, at least not as of this writing, and I presume by extension it can't be intercepted by a service worker either. I ran some experiments looking for a scheme that would be accepted by fetch() and so on, and concluded that browser URL parsing inconsistencies and limitations on fetch meant there was no way to produce an authentic automerge URL that would work reliably without doing things like adding a made-up hostname or calling it HTTPS.

uuid

I've kept the UUID as the "deep" representation of the data to maintain some amount of consistency with past versions, and also because using ~128b of entropy to identify some unique resource is pretty much industry standard. I could have thought of a shorter data type (maybe 64 bit is enough) but there's no reason to be cute here.

bs58 / bs58check

The encoding serves two purposes. First, a 16 byte UUID encoded into hex is 38 bytes of text. A bs58 encoding is a third shorter, at 24. We then "spend" four of those bytes back to add a checksum which allows us to detect if the URL was copy-pasted with a character missing or (worse) if someone is just passing wildly unrelated values into the system.

library internals

I've concluded the URL format should only exist "at the edge" of the library. Internally, on disk, and over the network we should use the most efficient 16 byte binary representation.

This URL format should be stored in the document as a string (for now), but at some point in the future we may add a custom type for it to optimize storage and retrieval of document connections.

Internally to the library, we should use the textual BS58check representation of the UUID for logging and anything user facing. This will allow the user to make visual comparisons to the URL they passed in.

Conclusions

URLs are hard! But this seems like a reasonable plan. The most likely critiques I anticipate are with the bs58check system, and my theory there is that it's better to use something off the shelf than to invent something new.

I welcome your questions / comments.

Peter van Hardenberg commented a year ago

Shipped!

Martin Kleppmann · Answer 1 · Sat Jul 22 2023 15:59:42 GMT+0800 (China Standard Time)

I'm supportive of this plan. 128 bit length is good. I think it's a good idea to add a new datatype to Automerge that stores 16 binary bytes internally, and converts it to/from a bs58check URI in the API. We will need to make the internal encoding extensible, so that new features such as branch IDs/heads can be added to future URIs without a breaking change. For spidering purposes we could also offer an API to parse all of the Automerge URIs out of a compressed document without loading the whole thing into memory — it should be possible to do this very efficiently by scanning only the value metadata and raw value columns.

Peter van Hardenberg · Answer 2 · Sat Jul 22 2023 23:43:32 GMT+0800 (China Standard Time)

Agreed -- one question I continue to noodle on is how the document ID should be represented outside of the storage format. I think my current position (which I could be talked out of) is that that a document ID is always shown as a bs58-check encoded UUID. If we present the document IDs as something closer to their source form (hex strings, for example) then it will be confusing / annoying for users who are trying to correlate URLs and the underlying IDs.

On the other hand, the spec for bs58check is to add four bytes of SHA-256(SHA-256(uuid)) as the checksum which seems (without benchmarking) annoyingly expensive. We could define a different format but... now we're inventing things.

Martin Kleppmann · Answer 3 · Sun Jul 23 2023 00:21:14 GMT+0800 (China Standard Time)

lol why hash it twice? That seems rather strange. On the upside, we already include an implementation of SHA256 so at least it shouldn't have much impact on the Wasm module size. Might be worth doing a very simple benchmark to check whether it would have a noticeable impact on loading times. I suspect it's probably fast enough. Agree about always showing the bs58-check encoded URI in the external API.

Neftaly Hernandez · Answer 4 · Wed Jul 26 2023 08:28:36 GMT+0800 (China Standard Time)

I think the double-hashing is to prevent certain attacks when bs58check is used for cryptographic identity. CRC would be better for perf (presuming we don't need security), but as mentioned we'd have to implement it ourselves.

alexjg · Answer 5 · Tue Aug 01 2023 21:47:41 GMT+0800 (China Standard Time)

This approach makes sense to me. The only additional question I have is whether we use the same format for binary blobs stored in an automerge repo. We probably don't want users to store large binary blobs in automerge documents but I don't think we want to make users roll their own storage and sync for binary blobs because that way you lose a lot of interoperability (you end up with things like git LFS). This has implications for the internal representation of the document ID type (you probably want to be able to scan for all the reachable blobs as well as documents from a given document) but I think the important question for this discussion is whether an automerge URL should be able to point at a binary blob stored in automerge repo and if so whether the URL should indicate that it is a blob or whether that's something which is determined when you actually try and load or request the blob.

alexjg · Answer 6 · Wed Aug 02 2023 16:34:46 GMT+0800 (China Standard Time)

After some sync discussion with @pvh and @ept I think we don't need to do anything special for handling binary blob URLs. We also decided that the URL shouldn't specify the length of the underlying bytes.