[Feature] Add a new flag for regctl digest to decompress
thesayyn opened this issue · comments
Current Behavior
I realized there is a hidden command regctl, regctl digest
which is really useful for calculating hashes for blobs. However, it can't used for calculating the diffid, which is the decompress digest of a blob.
Expected Behavior
A new flag to regctl digest --decompress=gzip,lzma,auto
which decompresses the blob using gzip or xz before calculating the digest.
Version
(paste output of `regctl version` or similar for binaries, or `docker image inspect regclient/...` on the image and paste the labels)
I am happy to contribute this, if there's no concern @sudo-bmitch
Hidden commands tend to be unsupported, experimental, or temporary. In this case, my plan was to remove it from regctl
for a while since I've just used sha256sum
instead. And for decompressing, that would end up being a command like gunzip -c $file | sha256sum -
. For your use case, would an implementation of #721 remove the need for this?
I understand, i agree that there are alternatives exist like you said but under bazel that means we have to fetch more binaries to have gunzip
shasum
and zstdcat
which a lot more cumbersome than running regctl digest
.
We have to do things like pre-building pigz for faster build times: https://github.com/thesayyn/pigz-prebuilt/releases/tag/v2.8
I'll admit that this is making my life easier, in return for a little bit of code in here. My most compelling argument would that this feature will complement #721 solution 2
where under Bazel the digest for each layer is calculated out-of-band using this feature.
#728 is already implementing this, maybe you could take a look and see if it's too much to accept.
As for moving this command, if you decide to keep it, i believe this could be called regctl blob calculate-digest
.
I understand, i agree that there are alternatives exist like you said but under bazel that means we have to fetch more binaries to have
gunzip
shasum
andzstdcat
which a lot more cumbersome than runningregctl digest
. We have to do things like pre-building pigz for faster build times:https://github.com/thesayyn/pigz-prebuilt/releases/tag/v2.8
Given some of your use cases, I'm surprised you didn't opt to make your own binary. It could still leverage a library like regclient, but give you the ability to add exactly the features you need without including a bunch of out of scope code. That's the direction a bunch of others have gone: https://github.com/regclient/regclient/network/dependents
Oh we are really trying hard to not create our own tooling for various reasons. One of them is that what we are doing in rules_oci isn't specific to Bazel, what we are doing is same as what everyone else is doing so we must use off the shelf tooling.
Principle is: Bazel is a build tool there should run existing tools without a significant behavior difference.
Since you mentioned out of scope, do you think this change is out of scope?
My perception was that diffid calculation is a pretty big part of container images. Most expensive part in container assembly.
I'd lean towards saying the regctl digest
command isn't a core competency of regctl
, and so I hesitate to have a dependency on it since it could be removed in a future release. The one place I can think of that computes the diffid is regctl image mod
. For the rest of the regctl commands, it uses the digest on the content for validation and addressability, which doesn't need the diffid, only the digest or descriptor of the blob or manifest.
None of this is a hard no, just a hesitation to say yes knowing that it adds a new feature to maintain that hasn't been needed by other users.
I'm still fuzzy on why bazel wants to precompute the digest of a new layer, versus letting the tooling compute it during the regctl image mod
command. One advantage of letting regctl compute the digests is that both can be computed simultaneously when reading and compressing the content. Is there a requirement to verify the content wasn't tampered when being passed to regctl, is there a need to extract the digests for logging, or a something else?
I'm still fuzzy on why bazel wants to precompute the digest of a new layer, versus letting the tooling compute it during the
regctl image mod
command.
It could be done either way, but calculating it before running regctl image mod
has a few advantages over calculating it during regctl image mod
.
TLDR:
1- Incrementality: doing the diffid/digest calculation as part of regctl image mod
will yield poor results on incremental changes to the layer because it will have to redo all the work for unchanged layers as well.
2- Cacheability: diffid/digest calculation for the layers can be cached, redone when any of them change.
Let me explain with an example, forget about Bazel for a moment. Let's say we are creating an image;
- It has 8 layers and each layers diffid (
5s
) and digest (5s
) calculation takes10s
in total - It uses an empty base image, assume this takes 0 seconds for simplicity
- Running on 8 cores
- First approach where we calculate diffid as part of
regctl image mod
and we optimized it to run at all 8 cores
- Run,
regctl image mod --layer=1.tar.gz ... --layer=10.tar.gz
, ->10s total
- Change the fifth layer
- Run
regctl image mod --layer=1.tar.gz ... --layer=10.tar.gz
->10s total
again because all the work has to be done again.
Whole workflow took 20s
, 10s
for each invocation.
- Second approach where calculate diffid out-of-band and computed at parallel on 8 cores.
- Run
regctl digest --decompress < N.tar.gz
for each layer, write toN.diffid
->5s total
- Run
regctl digest < N.tar.gz
for each layer, write toN.digest
->5s total
- Run
regctl image mod --layer=$(cat N.diffid)=$(cat N.digest)=N.tar.gz
->0s
(everything is already calculated) - Change the fifth layer
- Run
regctl digest --decompress < 5.tar.gz
andregctl digest < 5.tar.gz
->5s
(computed diffid and digest in parallel) - Run
regctl image mod --layer=$(cat N.diffid)=$(cat N.digest)=N.tar.gz
->0s
(everything is already calculated)
Whole workflow took 15s
, 10s
for the first invocation, and 5s
for the second.
These numbers are made up, just to give you an idea, in reality number are far worse for first option due to IO pressure from having to redo all the work.
Hope the use case is clearer now.
None of this is a hard no, just a hesitation to say yes knowing that it adds a new feature to maintain that hasn't been needed by other users.
My use regctl is a little bit more nuanced than an usual user to be frank, i was hoping you'd see it being useful for some other use cases, more specifically i thought it would be useful for people dealing with OCI artifacts.
All that said, no hard feelings if don't feel like regctl should have something like this. (I understand as a maintainer)
1- Incrementality: doing the diffid/digest calculation as part of
regctl image mod
will yield poor results on incremental changes to the layer because it will have to redo all the work for unchanged layers as well.
I don't understand this. The digest of the other layers has not changed, so you would only need to compute the digest of the layers that have changed. I think there is some logic in Docker to chain metadata from the various steps of the build, but that doesn't exist in OCI images or in buildkit to the best of my knowledge. Each layer can be treated as an independent entity, and only when they are assembled with the overlay filesystem will the effect of one layer to another be seen.
This has something to do with the build system a little bit, what happens to the result of regctl image mod
, if the result of previous run is thrown away, then there's no way the one can know what's changed unless it looks into layers and see if it exists already.
Bazel incrementality model is, action, and an action are basically inputs + tools = output
so any time any of the inputs change the command is invoked again, the previous output artifact is thrown away and new one is stored in the cache.
This has something to do with the build system a little bit, what happens to the result of
regctl image mod
, if the result of previous run is thrown away, then there's no way the one can know what's changed unless it looks into layers and see if it exists already.
Do you have more details on this process? You can't both throw away an image and mod the image simultaneously. Are you trying to maintain a build cache outside of the repository. If that's the case, I think building the manifest and pushing the blobs directly from your tooling would have a better experience. A regctl image mod
command will likely recompute the digests since the use case is for users that don't want to construct the image and manage the DAG themselves.
You can't both throw away an image and mod the image simultaneously.
You are right, the previously modded image gets thrown away, and the flow i described above gets executed from scratch. Bazel describes this well here: https://bazel.build/basics/artifact-based-builds.
Are you trying to maintain a build cache outside of the repository
Yes, ocidir is a cache effectively, but for subsequent builds don't have access to the prior ocidir.
If that's the case, I think building the manifest and pushing the blobs directly from your tooling would have a better experience.
Under rules_oci building and pushing happens separately, we store everything in an ocidir, which is fast.
A
regctl image mod
command will likely recompute the digests since the use case is for users that don't want to construct the image and manage the DAG themselves.
That's why i proposed two flags to regctl image mod
, regctl image mod --layer=layer.tar.gz
computes diffid/digest as part of regctl image mod
, and the regctl image mod --layer=<diffid>,<digest>,layer.tar.gz
which just appends new descriptor to layers array and calls it a day.
This flag is more important to me than regctl image mod --layer
though.
Hope it's more clear what i am trying to do now.
Sorry to be being a little pushy, do you think #728 can land? If not i have to find another way to do it.
Hi @thesayyn, I'm focused on some other issues at the moment, so if you are in a rush, I'd make other plans. This is still being considered, but I haven't come to a decision yet, and don't want to hold you up.
I see thanks!
My most compelling argument would that this feature will complement #721
solution 2
where under Bazel the digest for each layer is calculated out-of-band using this feature.
Since the solution to #721 focused on the first option, where users do not provide the digests, and the input must be a tar file (not a compressed tar), is this feature still needed? Looking back through the discussion, I'm incline to suggest that either the tar file is provided and regctl computes everything (with regctl image mod --layer-add...
), or the tooling manages the layers, config, and manifests directly (with regctl manifest put
and regctl blob put
).
Trying to implement the middle ground where layers and their digests are managed externally with regctl allowing a mod of the image trusting those values feels too error prone to me. I'd worry about issues raised by users trying to use the feature without understanding the difference between a layer diffid and a blob digest.
I understand, I did implement this with a custom tooling combining jq + regctl and zstd. I longer need this.