google / trillian

A transparent, highly scalable and cryptographically verifiable data store.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support ExtraData de-duplication in MySQL backend

rolandshoemaker opened this issue · comments

(This is a rather complicated proposition, so it's more of a speculative request, rather than a concrete proposal.)

The CT personality builds leaves for inclusion that include in the ExtraData field the user submitted chain, minus the end-entity certificate. This data has relatively low cardinality and takes up a considerable amount of space in the storage backend.

It would be great if there was an option to store the ExtraData in a separate table (or something) from LeafData so that they could be deduped (based on data hash or something) with a reference to the relevant row in LeafData in a many-to-one setup. In certain setups this could save >50% of current storage requirements. It seems likely that this optimization is only really relevant to the CT usage of trillian, so I'm not entirely sure if there are issues this could cause for other personalities. This would, likely, incur slightly more expensive MySQL queries.

It's an interesting request, though as you say possibly more a CT thing than generic.

We have considered options like not storing leaf data in the database. As it's immutable it could be served from edge caches or whatever. That was more for performance than saving disk space though. e.g. you could pack up ranges of leaves and serve them much faster than mysql can.

Not sure we'd want to make this sort of schema change now but It's possible that experiments along these lines could be done with a modified CT personality that stores some sort of cache ID instead of the leaf data. That might be a place to start.

From a purely Trillian point of view, ExtraData should probably not be in there at all, as it has absolutely nothing to do with the Merkle tree.

Having the data together makes it possible (even though it's not the case at the moment) to make get-entries (which makes up about 75% of read requests on our CT logs, the rest being mostly get-sth, and trace amounts of get-sth-consistency) have a much lower latency, by doing only one sequential read, instead of 1 sequential read + N lookups (where N is the number of entries fetched). This would probably be fewer lookups if they were deduped, though.

Note that if Trillian didn't use revisions for subtrees of logs (which I believe it doesn't need, only maps need them?), you might be looking at both a speedup of proof retrieval and a very sizable reduction in storage requirements (I don't have numbers handy, but if someone told me about 50%, I'd believe it). It would also probably speed up sequencing.

This discussion has migrated to google/certificate-transparency-go#691, as it was CT-specific. Generalisations are possible, but we tend to think that they should happen on the personality side.

I propose to close this issue here.

A good follow-up request from this one would be optional interleaving ExtraData field into the sequenced leaf data. This would help making get entries calls faster for those kinds of use-cases.