layout: sharding the blob store

Question

layout: sharding the blob store

cyphar opened this issue 8 years ago · comments

One issue that I'm quite worried about is the performance impact of having too many blobs inside an OCI image. Now, practically speaking I would be surprised if n > 20 in most cases, but some people have expressed that they would like to have the entire universe bottled into an OCI image. I will refrain from commenting on how good of an idea I think that is, but if it's going to be a "valid usecase" then we should reconsider how we've organised the blob directory.

Namely, the current method of blobs/<algo>/<digest> will cause problems if the number of digests becomes quite large, due to implementation issues of filesystems. Essentially all filesystems are not designed to handle accesses of directories with many files well. If you look at how git, camlistore and many other such projects implement their blob storage it looks more like blobs/<algo>/<prefix>/<suffix> (or in camlistore's case, three sets of <prefix>/).

Naturally this would be a backwards incompatible change (you can't really implement this scheme as well as retaining the old one because then you have an exponential number of ways to read the same blob data, almost certainly leading to countless implementation bugs). So we should probably consider this for post-1.0.0.

W. Trevor King · Answer 1 · Sun Nov 06 2016 12:17:58 GMT+0800 (China Standard Time)

On Sat, Nov 05, 2016 at 04:50:36PM -0700, Aleksa Sarai wrote:

… but some people have expressed that they would like to have the
entire universe bottled into an OCI image…

That may be me ;). I'd rather phrase this as “I'd like the whole
universe in one flat CAS namespace, with individual CAS engines biting
off as large a chunk of that universe as they like”. What I've tried
to supply in opencontainers/runtime-tools#5 is an API that works
regardless of the number of blobs in CAS.

Whether a particular implemenation of that API (e.g. image-layout)
scales to huge blob counts (clearly the tar-backed image-layout does
not) is a less important question. Folks will just use a different
ref/CAS engine when they have large stores. But ref/CAS consumers
shouldn't have to worry about that sort of implementation detail.

Namely, the current method of blobs/<algo>/<digest> will cause
problems if the number of digests becomes quite large, due to
implementation issues of filesystems. Essentially all filesystems
are not designed to handle accesses of directories with many files
well.

This has come up before in #94 and #208, with the bulk of the
discussion based on 1. The consensus (as I understood it) was that
we shouldn't worry about this for now because modern filesystems don't
mind and tar isn't going to care either way. Having stable, scalable
APIs buffers downstream consumers from any future CAS-storage
optimizations.

Brandon Philips · Answer 2 · Thu Nov 17 2016 05:22:22 GMT+0800 (China Standard Time)

Agreed this is a dupe of #208.

Aleksa Sarai · Answer 3 · Thu Nov 17 2016 13:30:12 GMT+0800 (China Standard Time)

@philips It's not a dupe of #208. #208 was about blobs/sha256/<the full digest> rather than blobs/sha256/<three byte>/<rest of digest> (which is what this is about). But I don't have strong opinions because I don't agree with @wking's wish to stuff everything into a single CAS.

Nell Boulle · Answer 4 · Mon Nov 21 2016 19:49:53 GMT+0800 (China Standard Time)

@cyphar I guess in particular #208 (comment) challenges the premise of this issue

Akihiro Suda · Answer 5 · Mon Feb 13 2017 15:50:50 GMT+0800 (China Standard Time)

This seems not a dupe of #208.

Even though pulling operation should never call readdir(), pushing may call readdir() depending on the distribution protocol and its implementation, and likely to result in poor performance.

Also, there can be 3rd party tools (e.g. malware scanner, back-up) that are not aware of OCI manifest and hence result in calling readdir().

Can we reconsider this issue?

Akihiro Suda · Answer 6 · Mon Feb 13 2017 16:30:09 GMT+0800 (China Standard Time)

Since the layout of blobs/<algo> can be no longer changed, we might need to come up with some alternative layout.

Some my ideas and pros/cons:

blobs-sharded/<algo>/<prefix>/<digest>
Pro: Does not contaminate the existing blobs directory
Con: Maybe it is confusing to have two blobs directory? (blobs and blobs-sharded)
blobs/<algo>-sharded/<prefix>/<digest>
Pro: Single blobs directory
Con: sha256-sharded looks as if it is an algortithm, and can cause some implementation issue
blobs/<algo>/<prefix>/<digest> (identical to the original proposal)
Pro: Single blobs directory, no algorithm namespace contamination
Con: It can be 2X slower to scan the content of blobs/sha256, because the directory is likely to contain traditional blobs as well for compatibility

My preference is 1.

Also, we would need to define new field for the list of supported blob layouts in the oci-layout file. (or index.json maybe)

e.g.

{
    "imageLayoutVersion": "42.0.0"
    "supportedBlobLayouts": [ // if empty, "v1compat" is implicitly selected
        "v1compat",
	"sharded"
        // there can be other layouts that is specific to the distribution protocol? (e.g. "ipfs")
    ]
}