deislabs / bindle

Bindle: Object Storage for Collections

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Bindles mix naming and ontology

npmccallum opened this issue · comments

A parcel is defined by its ontology, namely the hash of its contents. If a parcel changes, its hash changes. This is great precisely because immutability is enforced throughout the entirety of the chain. At any point you can validate that the parcel is unmodified.

An invoice is, in reality, just a kind of parcel where the server knows how to introspect its contents. And yet, an invoice is not defined by its ontology but a name. Therefore, it is not possible to track the immutability of the invoice throughout the system.

Naming and ontology are two different layers and bindle's current approach mixes these layers. It appears that the decision to mix these layers was based on the desire to be able to gain isomorphism between invoice encodings (TOML, JSON, CBOR, etc). But it isn't obvious to me why this is a design goal and why we have to give up the most important property of a content store (verifiable immutability of all contents) in order to achieve it.

IMHO, bindle should operate more like the other successful content stores (git, S3, OCI, docker hub, etc) where all objects are immutable and referred to by ontology. Naming is a layer above it and you can "tag" a name (which includes version) to a particular object.

An invoice was not modeled as a parcel at all. It was modeled as a set of structured metadata with well-defined fields. A parcel is to be thought of as "free form data" of which the system knows very little about. In contrast, an invoice is a set of known fields arranged in a particular way.

Because these are different things, they can be reasoned about differently. A parcel is just a blob of data, and any change to that blob of data should rightly raise our eyebrows. But an invoice is about the semantics of the object, not the syntax. We want to be able to reason about what the invoice means, and detect any change to what the invoice means. We don't particularly care about a change to the syntax (e.g. whether whitespace has been compressed, whether it has been formatted in JSON or YAML or TOML or XML, etc). All we care about is that the semantic content of the invoice is unmodified.

Ideally, what we want, then, is for a way to establish semantic immutability without caring about syntax -- we want to verify the meaning without verifying the presentation is the same presentation as it was before. This is valuable for several reasons, but the easiest one is that we can write documents in a human-readable format, but then have the system adapt those documents to whatever the technical requirements of the consuming agents are.

While I am not thrilled with the current state of things, it does achieve this to a limited extent. That is, by recomposing fields in a trivial format, one can regenerate the merkel tree of the parcels.

I'm not opposed to having a canonical representation of the invoice that we could transform an invoice into and then hash to generate. E.g. an ordered CBOR document would be fine for something like that. I don't actually feel too strongly about this particular feature of Bindle. It was done largely on pragmatic grounds, and to get away from the ridiculousness of having spurious "mutations" simply because (say) Go's serializer formats things slightly differently than (say) Java's.

If we were to change to hashing the serialized object, we would need to make a few changes: We probably need to switch signatures to be detached objects, rather than being presented on the invoice. That would essentially allow signing to occur without mutating the invoice in any way. Yanking could likely be done the same way. /cc @thomastaylor312

We're basically at the crossroads of the Enarx project trying to decide whether to invest a person on Bindle or to build our own. I'd really prefer the former since Bindle is very close to what we want and I think a lot more people can benefit. Here's what I propose:

  1. Discard the invoice conversions between serializations. I suspect this is an anti-feature. Just pick a serialization.

I don't buy the argument that the system can adapt the documents to the technical requirements of the consuming agents. This is because the consuming agents need to understand the Bindle protocol anyway. The particular serialization of the invoice is a far smaller requirement than the Bindle protocol as a whole.

The other problem is that, while the goal of establishing semantic immutability is a noble one, there is no industry accepted method for doing this. The only thing we have is byte-for-byte immutability. And this is particularly true if you want a signature on the invoice. The moment you add a signature to an invoice it becomes byte-for-byte immutable.

  1. Make invoices immutable. You upload them in TOML (Or CBOR? I'm not picky.) and they are never mutated. They are measured as raw bytes in their serialized form. We don't have to worry about differences in serializers because the serialization is immutable and they will all deserialize the same.

  2. An invoice and its parcels represent a Merkel tree. The invoice represents the top-level hash. Each modification of the invoice creates a completely new Merkel tree (though, you can still deduplicate on parcels).

  3. Names and versions are a layer above. Naming (and versioning) represent a human provided value on top of the mathematical model (the Merkel tree). For example, you can have the assertion that foo-1.2.3 = deadbeef.... This assertion can be signed. So, for example: signature(key, foo-1.2.3 = deadbeef...). Yanking disassociates the name/version from the Merkel tree. But the contents are never removed.

  4. Invoices themselves can be parcels. This might solve the namespace overloading problem we see in issues #266, #269 and #270. For example, you can have a top-level bindle that is foo-1.2.6. It contains bindles for build variants (#269). The build variant bindles, in turn, can contain architecture specific parcels.

If we can agree on an approach, I can dedicate someone to work on this before the end of the month.

I may not be quite following the argument here but how does this work if a server stores invoice data as rows in a relational database (e.g. for ease of lookup in larger systems) rather than as a blob?

@itowlson That is an internal implementation detail. The API would need to reproduce the invoice byte-for-byte as it was uploaded. But you can store the semantic meaning any way you want internally for things like queries and such.

Let me try to rephrase the problem another way.

Let's say that a Bindle server gets compromised. During the compromised period, workloads were deployed from Bindle. The Bindle server and the workload runner belong to two different parties.

The owner of the Bindle server publishes that a compromise occurred during a certain window of time. The workload owners now need to do forensic reconstruction of the logs for all the systems which deployed a workload from Bindle during this period to find out what was compromised.

You see in the log that workload foo-1.2.3 was deployed. But what are the actual contents of that package? How can you know?

On the other hand, if your log contains foo-1.2.3 (HASH), you can validate that the hash of the whole package hasn't changed.

In order to accomplish this, the invoice needs to be divided into two parts:

  1. The first part contains only the collection of parcels and the metadata about them. It has no name and no version. This half of the invoice is represented by the cryptographic hash of its canonical serialization (which is byte-for-byte unmodified since its upload). Because the invoice contains cryptographic hashes of all parcels, the (unnamed) invoice becomes the head of a Merkel tree.

  2. The second part contains the "tag" of a particular (unnamed) bindle. Stated another way, this is an assertion that a bindle with a particular hash is foo-1.2.3. This assertion can be signed. Yanking a bindle is simply disassociating the name/version from the bindle hash.

A few important properties arise from this:

  1. Workloads can pin to a particular bindle. They get immutability without signatures.
  2. Forensic reconstruction can identify, without signatures, the exact contents of a particular workload.
  3. The relation between a name/version and its bindle hash can be public data even if the contents of the bindle are secret.

This is roughly the way that docker hub and OCI container registries work the way they do. And these represent real advancement in the state of the art. It would be a shame to lose these properties while trying to build something better.

A couple of comments:

First off, I think the idea of having a canonical serialization and a hash of that data is a good idea. I actually think splitting in to two parts is a possible good solution here (pending me reasoning through it some more). Basically it would make the signing part of invoices much simpler (rather than needing to reconstruct the entire parcel list). However, with that said there are a couple things that should still be requirements:

  1. Signing and verifications should still be required. If we've learned anything from doing this with containers and our work in Helm, if we don't make the signing stuff a default/required, then it will never be used. The signature part is an important feature of Bindle IMO
  2. Yanking should not be a mutable operation. That means that it won't disassociate the name from the invoice hash, but just mark the release as unavailable (similar to what happens with Rust crates). Basically we were trying to avoid the nightmare of OCI tags and how they could change willy nilly

I am not a fan of having an invoice be a parcel as well because as @technosophos stated, they are fundamentally different things (one expressing relationships and one being arbitrary data). Although with the idea of splitting the invoice into 2 parts, this becomes more of an implementation detail.

@npmccallum Thanks for the clarification. Your use case clarifies the goal, and seems like a useful thing to have. I agree that storage format should be an implementation detail, but was struggling to understand how to reconcile that with (what I understood as) the proposal to treat invoices as parcels.