[Feature] Add a new flag for regctl digest to decompress

Question

[Feature] Add a new flag for regctl digest to decompress

thesayyn opened this issue 2 months ago · comments

Current Behavior

I realized there is a hidden command regctl, regctl digest which is really useful for calculating hashes for blobs. However, it can't used for calculating the diffid, which is the decompress digest of a blob.

Expected Behavior

A new flag to regctl digest --decompress=gzip,lzma,auto which decompresses the blob using gzip or xz before calculating the digest.

Version

(paste output of `regctl version` or similar for binaries, or `docker image inspect regclient/...` on the image and paste the labels)

Sahin Yort · Answer 1 · Sat Apr 27 2024 01:58:40 GMT+0800 (China Standard Time)

I am happy to contribute this, if there's no concern @sudo-bmitch

Brandon Mitchell · Answer 2 · Sat Apr 27 2024 03:49:34 GMT+0800 (China Standard Time)

Hidden commands tend to be unsupported, experimental, or temporary. In this case, my plan was to remove it from regctl for a while since I've just used sha256sum instead. And for decompressing, that would end up being a command like gunzip -c $file | sha256sum -. For your use case, would an implementation of #721 remove the need for this?

Sahin Yort · Answer 3 · Sat Apr 27 2024 03:58:55 GMT+0800 (China Standard Time)

I understand, i agree that there are alternatives exist like you said but under bazel that means we have to fetch more binaries to have gunzip shasum and zstdcat which a lot more cumbersome than running regctl digest.
We have to do things like pre-building pigz for faster build times: https://github.com/thesayyn/pigz-prebuilt/releases/tag/v2.8

I'll admit that this is making my life easier, in return for a little bit of code in here. My most compelling argument would that this feature will complement #721 solution 2 where under Bazel the digest for each layer is calculated out-of-band using this feature.

Sahin Yort · Answer 4 · Sat Apr 27 2024 03:59:34 GMT+0800 (China Standard Time)

#728 is already implementing this, maybe you could take a look and see if it's too much to accept.

Sahin Yort · Answer 5 · Sat Apr 27 2024 04:02:36 GMT+0800 (China Standard Time)

As for moving this command, if you decide to keep it, i believe this could be called regctl blob calculate-digest.

Brandon Mitchell · Answer 6 · Sat Apr 27 2024 23:33:33 GMT+0800 (China Standard Time)

I understand, i agree that there are alternatives exist like you said but under bazel that means we have to fetch more binaries to have gunzip shasum and zstdcat which a lot more cumbersome than running regctl digest. We have to do things like pre-building pigz for faster build times: https://github.com/thesayyn/pigz-prebuilt/releases/tag/v2.8

Given some of your use cases, I'm surprised you didn't opt to make your own binary. It could still leverage a library like regclient, but give you the ability to add exactly the features you need without including a bunch of out of scope code. That's the direction a bunch of others have gone: https://github.com/regclient/regclient/network/dependents

Sahin Yort · Answer 7 · Sun Apr 28 2024 04:07:25 GMT+0800 (China Standard Time)

Oh we are really trying hard to not create our own tooling for various reasons. One of them is that what we are doing in rules_oci isn't specific to Bazel, what we are doing is same as what everyone else is doing so we must use off the shelf tooling.

Principle is: Bazel is a build tool there should run existing tools without a significant behavior difference.

Since you mentioned out of scope, do you think this change is out of scope?
My perception was that diffid calculation is a pretty big part of container images. Most expensive part in container assembly.

Brandon Mitchell · Answer 8 · Sun Apr 28 2024 05:08:13 GMT+0800 (China Standard Time)

I'd lean towards saying the regctl digest command isn't a core competency of regctl, and so I hesitate to have a dependency on it since it could be removed in a future release. The one place I can think of that computes the diffid is regctl image mod. For the rest of the regctl commands, it uses the digest on the content for validation and addressability, which doesn't need the diffid, only the digest or descriptor of the blob or manifest.

None of this is a hard no, just a hesitation to say yes knowing that it adds a new feature to maintain that hasn't been needed by other users.

I'm still fuzzy on why bazel wants to precompute the digest of a new layer, versus letting the tooling compute it during the regctl image mod command. One advantage of letting regctl compute the digests is that both can be computed simultaneously when reading and compressing the content. Is there a requirement to verify the content wasn't tampered when being passed to regctl, is there a need to extract the digests for logging, or a something else?

Sahin Yort · Answer 9 · Sun Apr 28 2024 06:45:44 GMT+0800 (China Standard Time)

I'm still fuzzy on why bazel wants to precompute the digest of a new layer, versus letting the tooling compute it during the regctl image mod command.

It could be done either way, but calculating it before running regctl image mod has a few advantages over calculating it during regctl image mod.

TLDR:

1- Incrementality: doing the diffid/digest calculation as part of regctl image mod will yield poor results on incremental changes to the layer because it will have to redo all the work for unchanged layers as well.
2- Cacheability: diffid/digest calculation for the layers can be cached, redone when any of them change.

Let me explain with an example, forget about Bazel for a moment. Let's say we are creating an image;

It has 8 layers and each layers diffid (5s) and digest (5s) calculation takes10s in total
It uses an empty base image, assume this takes 0 seconds for simplicity
Running on 8 cores

First approach where we calculate diffid as part of regctl image mod and we optimized it to run at all 8 cores

Run, regctl image mod --layer=1.tar.gz ... --layer=10.tar.gz, -> 10s total
Change the fifth layer
Run regctl image mod --layer=1.tar.gz ... --layer=10.tar.gz -> 10s total again because all the work has to be done again.

Whole workflow took 20s, 10s for each invocation.

Second approach where calculate diffid out-of-band and computed at parallel on 8 cores.

Run regctl digest --decompress < N.tar.gz for each layer, write to N.diffid -> 5s total
Run regctl digest < N.tar.gz for each layer, write to N.digest -> 5s total
Run regctl image mod --layer=$(cat N.diffid)=$(cat N.digest)=N.tar.gz -> 0s (everything is already calculated)
Change the fifth layer
Run regctl digest --decompress < 5.tar.gz and regctl digest < 5.tar.gz -> 5s (computed diffid and digest in parallel)
Run regctl image mod --layer=$(cat N.diffid)=$(cat N.digest)=N.tar.gz -> 0s (everything is already calculated)

Whole workflow took 15s, 10s for the first invocation, and 5s for the second.

These numbers are made up, just to give you an idea, in reality number are far worse for first option due to IO pressure from having to redo all the work.

Hope the use case is clearer now.

Sahin Yort · Answer 10 · Sun Apr 28 2024 06:50:35 GMT+0800 (China Standard Time)

None of this is a hard no, just a hesitation to say yes knowing that it adds a new feature to maintain that hasn't been needed by other users.

My use regctl is a little bit more nuanced than an usual user to be frank, i was hoping you'd see it being useful for some other use cases, more specifically i thought it would be useful for people dealing with OCI artifacts.

All that said, no hard feelings if don't feel like regctl should have something like this. (I understand as a maintainer)

Brandon Mitchell · Answer 11 · Mon Apr 29 2024 21:30:20 GMT+0800 (China Standard Time)

1- Incrementality: doing the diffid/digest calculation as part of regctl image mod will yield poor results on incremental changes to the layer because it will have to redo all the work for unchanged layers as well.

I don't understand this. The digest of the other layers has not changed, so you would only need to compute the digest of the layers that have changed. I think there is some logic in Docker to chain metadata from the various steps of the build, but that doesn't exist in OCI images or in buildkit to the best of my knowledge. Each layer can be treated as an independent entity, and only when they are assembled with the overlay filesystem will the effect of one layer to another be seen.

Sahin Yort · Answer 12 · Mon Apr 29 2024 22:48:00 GMT+0800 (China Standard Time)

This has something to do with the build system a little bit, what happens to the result of regctl image mod, if the result of previous run is thrown away, then there's no way the one can know what's changed unless it looks into layers and see if it exists already.

Bazel incrementality model is, action, and an action are basically inputs + tools = output so any time any of the inputs change the command is invoked again, the previous output artifact is thrown away and new one is stored in the cache.

Brandon Mitchell · Answer 13 · Tue Apr 30 2024 02:51:12 GMT+0800 (China Standard Time)

This has something to do with the build system a little bit, what happens to the result of regctl image mod, if the result of previous run is thrown away, then there's no way the one can know what's changed unless it looks into layers and see if it exists already.

Do you have more details on this process? You can't both throw away an image and mod the image simultaneously. Are you trying to maintain a build cache outside of the repository. If that's the case, I think building the manifest and pushing the blobs directly from your tooling would have a better experience. A regctl image mod command will likely recompute the digests since the use case is for users that don't want to construct the image and manage the DAG themselves.

Sahin Yort · Answer 14 · Tue Apr 30 2024 03:25:07 GMT+0800 (China Standard Time)

You can't both throw away an image and mod the image simultaneously.

You are right, the previously modded image gets thrown away, and the flow i described above gets executed from scratch. Bazel describes this well here: https://bazel.build/basics/artifact-based-builds.

Are you trying to maintain a build cache outside of the repository

Yes, ocidir is a cache effectively, but for subsequent builds don't have access to the prior ocidir.

If that's the case, I think building the manifest and pushing the blobs directly from your tooling would have a better experience.

Under rules_oci building and pushing happens separately, we store everything in an ocidir, which is fast.

A regctl image mod command will likely recompute the digests since the use case is for users that don't want to construct the image and manage the DAG themselves.

That's why i proposed two flags to regctl image mod, regctl image mod --layer=layer.tar.gz computes diffid/digest as part of regctl image mod, and the regctl image mod --layer=<diffid>,<digest>,layer.tar.gz which just appends new descriptor to layers array and calls it a day.

This flag is more important to me than regctl image mod --layer though.

Sahin Yort · Answer 15 · Tue Apr 30 2024 03:25:28 GMT+0800 (China Standard Time)

Hope it's more clear what i am trying to do now.

Sahin Yort · Answer 16 · Tue Apr 30 2024 10:21:44 GMT+0800 (China Standard Time)

Sorry to be being a little pushy, do you think #728 can land? If not i have to find another way to do it.

Brandon Mitchell · Answer 17 · Wed May 01 2024 03:22:12 GMT+0800 (China Standard Time)

Hi @thesayyn, I'm focused on some other issues at the moment, so if you are in a rush, I'd make other plans. This is still being considered, but I haven't come to a decision yet, and don't want to hold you up.

Sahin Yort · Answer 18 · Wed May 01 2024 03:36:36 GMT+0800 (China Standard Time)

I see thanks!

Brandon Mitchell · Answer 19 · Mon May 13 2024 03:44:59 GMT+0800 (China Standard Time)

My most compelling argument would that this feature will complement #721 solution 2 where under Bazel the digest for each layer is calculated out-of-band using this feature.

Since the solution to #721 focused on the first option, where users do not provide the digests, and the input must be a tar file (not a compressed tar), is this feature still needed? Looking back through the discussion, I'm incline to suggest that either the tar file is provided and regctl computes everything (with regctl image mod --layer-add...), or the tooling manages the layers, config, and manifests directly (with regctl manifest put and regctl blob put).

Trying to implement the middle ground where layers and their digests are managed externally with regctl allowing a mod of the image trusting those values feels too error prone to me. I'd worry about issues raised by users trying to use the feature without understanding the difference between a layer diffid and a blob digest.

Sahin Yort · Answer 20 · Mon May 13 2024 04:28:35 GMT+0800 (China Standard Time)

I understand, I did implement this with a custom tooling combining jq + regctl and zstd. I longer need this.