image promotion is very slow despite low resource utilization
BenTheElder opened this issue Β· comments
When running in Kubernetes CI image promotion takes more than one hour, often multiple hours.
Worse, it takes an hour just running presubmit tests in k8s.io repo. As a result a quickly approved image promotion PR may take 3 hours or more to take affect.
I took a cursory pass at this by: kubernetes/test-infra#27743 + kubernetes/test-infra#27765 which set large (7 core, 40 Gi RAM) allocations, configure GOMAXPROCS to match, and increase --threads
(worker goroutines).
Since those changes, we can follow from a running job, to it's pod, to the node the pod is on, to the backing GCE VM.
On that VM, there should only be kubelet / system daemons, other workloads will not be scheduled here due to essentially no schedule-able resources being available (VM has 8 cores, we request 7, some are reserved for system agents).
We can see that CPU usage spikes when the job starts (git cloning?) and then the entire machine settles to just 3-7% utilization, outgoing connections only ~2/s to ~3/s, disk write < 0.6MiB/s, network egress < 0.7MiB/s far below the initial spike:
This implies we are bottlenecking heavily on something like waiting for the sigstore API ...?
Copying images could otherwise conceivably bottleneck on CPU or Network but both of those are hardly in use.
I half-joked about synchronized logging and the volume of logs, but I don't seriously think it's contention over the logger.
Besides investigating what exactly is causing this slow-down, I think we should consider enabling running N postsubmit jobs each responsible for a subset of the images, only triggered when the respective manifest directories change, based on run_if_changed
plus some option to the image promoter to specify which manifests to read.
/cc @kubernetes-sigs/release-engineering
OK, we have timestamps: https://prow.k8s.io/log?job=post-promo-tools-image-promo-canary&id=1582627370186051584
curl 'https://prow.k8s.io/log?job=post-promo-tools-image-promo-canary&id=1582627370186051584' | grep 'kpromo\['
level=info msg="kpromo[1666162785]: PromoteImages start"
level=info msg="kpromo[1666162785]: Parsing manifests"
level=info msg="kpromo[1666162785]: Creating sync context manifests"
level=info msg="kpromo[1666162785]: Getting promotion edges"
level=info msg="kpromo[1666162786]: Creating producer function"
level=info msg="kpromo[1666162786]: Validating staging signatures"
level=info msg="kpromo[1666162806]: Promoting images"
level=info msg="kpromo[1666162818]: Replicating signatures"
level=info msg="kpromo[1666162820]: Signing images"
level=info msg="kpromo[1666162842]: Finish"
A note about these times. These are from a run of our canary job. It is a test promotion using a canary image of the promoter we build at head. The canary job promotes an image to two mirrors that have almost nothing in them. But we can see the latency of the promotion plus the signing operations:
- Promoting the image takes 12 seconds
- Signing and verifying operations:
- Validating signatures from staging: 20 secs
- Replication of staging signatures: 2 secs
- Signing, which includes replication to the mirrors: 22 secs
A few notes here:
- # 1 seems unusually long and is worth investigating further. Especially since there are no signatures in the staging project. If it still takes this long it may be a bug and a good target to fix now.
- # 2 is quick in this example because there are no signed images in the staging repo. If there were, replicating the signatures implies effectively doing another image promotion which takes time.
- # 3 (signing) is the slowest part (but see previous two notes) it implies talking to sigstore and pushing the signature to the registry . This step includes not just signing but also replicating the signature "image" from the first mirror to the rest of them.
Looking at the issue "Validating signatures from staging: 20 secs" mentioned in the comment above. The method we measure is ValidateStagingSignatures
:
promo-tools/internal/promoter/image/sign.go
Lines 123 to 125 in b4fc93c
It's suspicious that we create a signer in ValidateStagingSignatures
as well as FindSingedEdges
, especially the multi use of the throttlers in combination with the signing API is kinda hard to debug:
promo-tools/internal/promoter/image/sign.go
Lines 74 to 91 in b4fc93c
I propse to start with some more verbose logging and tracing the timestamps between log messages in #640
signer.IsImageSigned
is indeed slow (~1s per image). I was able to trace the path as following:
IsImageSigned
(k/release-sdk)SignedEntity
(k/relase-sdk)SignedEntity
(cosign)remoteGet
(cosign)Get
(go-containerregistry)get
(go-containerregistry)makeFetcher
(go-containerregistry)makeFetcher
(go-containerregistry)NewWithContext
(go-containerregistry)
That's where it gets interesting:
- The
ping
definitely takes some time - Another time consumer is the
bt.refresh()
call
Test case code
package main
import (
"sync"
"time"
"github.com/sirupsen/logrus"
"sigs.k8s.io/release-sdk/sign"
)
type Hook struct {
lastTime time.Time
mu sync.RWMutex
}
func NewHook() *Hook {
return &Hook{
lastTime: time.Now(),
mu: sync.RWMutex{},
}
}
func (h *Hook) Fire(e *logrus.Entry) error {
h.mu.Lock()
e.Data["diff"] = e.Time.Sub(h.lastTime).Round(time.Millisecond)
h.lastTime = e.Time
h.mu.Unlock()
return nil
}
func (h *Hook) Levels() []logrus.Level {
return logrus.AllLevels
}
func main() {
logrus.SetFormatter(&logrus.TextFormatter{
DisableTimestamp: false,
FullTimestamp: true,
TimestampFormat: "15:04:05.000",
})
logrus.AddHook(NewHook())
signer := sign.New(sign.Default())
const img = "docker.io/ubuntu:22.04"
logrus.Info("Check if image is signed")
signed, err := signer.IsImageSigned(img)
if err != nil {
panic(err)
}
logrus.Infof("Is signed: %v", signed)
}
I added some traces around that logic in NewWithContext
in the same way as introduced in #640, with the output:
INFO[11:53:37.348] Check if image is signed diff=0s
INFO[11:53:37.349] NewWithContext start diff=0s
INFO[11:53:37.349] ping start diff=0s
INFO[11:53:37.810] ping end diff=462ms
INFO[11:53:37.810] bt.refresh start diff=0s
INFO[11:53:38.283] bt.refresh end diff=473ms
INFO[11:53:38.283] NewWithContext end diff=0s
INFO[11:53:38.450] NewWithContext start diff=167ms
INFO[11:53:38.450] ping start diff=0s
INFO[11:53:38.564] ping end diff=113ms
INFO[11:53:38.564] bt.refresh start diff=0s
INFO[11:53:38.704] bt.refresh end diff=140ms
INFO[11:53:38.704] NewWithContext end diff=0s
INFO[11:53:38.839] Is signed: false diff=135ms
The first call to NewWithContext
seems to be the critical path. Created a tracking issue in go-containerregistry about that topic: google/go-containerregistry#1466
It makes sense that the first call to setup the transport would need to negotiate auth with the specific registry.
On mobile: are we able to supply and reuse transports on our end? We should probably setup one per worker goroutine and reuse.
See also kubernetes-sigs/release-sdk#105
Looks like up through:
We could plumb a WithTransport() (google/go-containerregistry#1466 (comment)) but not:
Working on a faster version of image signature verification in kubernetes-sigs/release-sdk#123
We also need a better release-sdk API for the usage of VerifyImage
:
promo-tools/internal/promoter/image/sign.go
Lines 135 to 157 in 289d224
The method again checks if the image is signed (not required in our case) as well as does not reuse the transport.
Edit: Some ideas are now in kubernetes-sigs/release-sdk#124
I think one of the recent changes upped failed runs
https://prow.k8s.io/?job=*image-promo
We have failures like this:
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-k8sio-image-promo/1583138923981312000
time="17:17:43.175" level=fatal msg="run
cip run
: filtering edges: filtering promotion edges: reading registries: getting tag list: GET https://us-east1-docker.pkg.dev/v2/token?scope=repository%3Ak8s-artifacts-prod%2Fimages%2Fe2e-test-images%2Fvolume%2Frbd%3Apull&service=: TOOMANYREQUESTS: Quota exceeded for quota metric 'Requests per project per region' and limit 'Requests per project per region per minute per region' of service 'artifactregistry.googleapis.com' for consumer 'project_number:388270116193'." diff=97ms
IIRC Per jon the quota is pretty reasonable so π€ (forgetting the hard number, I asked)
https://cloud.google.com/artifact-registry/quotas
60000 requests per minute in each region or multi-region.
1000 qps per region ... what are we doing that is hitting > 1000 qps in a region?
https://prow.k8s.io/job-history/gs/kubernetes-jenkins/logs/ci-k8sio-image-promo
besides the sample above:
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-k8sio-image-promo/1582921237355565056
time="03:02:13.916" level=fatal msg="run
cip run
: filtering edges: filtering promotion edges: reading registries: getting tag list: Get "https://us-central1-docker.pkg.dev/v2/\": read tcp 10.4.2.155:33274->142.250.152.82:443: read: connection reset by peer" diff=2ms
We just need to tolerate connection blips probably, in that case.
yesterday:
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-k8sio-image-promo/1582781607704530944
level=fatal msg="run
cip run
: filtering edges: filtering promotion edges: reading registries: getting tag list: GET https://us-east1-docker.pkg.dev/v2/token?scope=repository%3Ak8s-artifacts-prod%2Fimages%2Fexperimental%2Fconformance-ppc64le%3Apull&service=: TOOMANYREQUESTS: Quota exceeded for quota metric 'Requests per project per region' and limit 'Requests per project per region per minute per region' of service 'artifactregistry.googleapis.com' for consumer 'project_number:388270116193'."
before that, last tuesday:
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-k8sio-image-promo/1582405078290010112
level=fatal msg="run cip run
: filtering edges: filtering promotion edges: reading registries: getting tag list: GET https://us-east1-docker.pkg.dev/v2/token?scope=repository%3Ak8s-artifacts-prod%2Fimages%2Fautoscaling%2Fvpa-recommender-s390x%3Apull&service=: TOOMANYREQUESTS: Quota exceeded for quota metric 'Requests per project per region' and limit 'Requests per project per region per minute per region' of service 'artifactregistry.googleapis.com' for consumer 'project_number:388270116193'."
us-east1 again
The promoter will look for every image, in every mirror to see how the target registry compares to the source. I think that is it, we have a ton of images and the new mirrors multiplied those lookups about 9x which is why the logs now are humongous too π
The promoter will look for every image, in every mirror to see how the target registry compares to the source. I think that is it, we have a ton of images and the new mirrors multiplied those lookups about 9x which is why the logs now are humongous too
π
This quota is per region though, so we have to be exceeding 1000 qps in one region
registry.k8s.io users are consuming a non-zero amount, and that will increase with time, but in us-east1 it's currently only ~10 qps even reaching registry.k8s.io and not all of that will reach AR.
Yeah I assume we contact the registry multiple times per image which should be clearly improved. But for example for checking if a signature exist we already need 2 requests: One for the digest of the main image and one for the signature digest.
Some enhancements regarding the signature verification got merged in release-sdk and promo tools. I still have some ideas how to improve the signing part.
Amazing, I will kick off a canary run later today and we can see it in action. Thank you Sascha, see you soon! π π
I think we can close this now, especially since we have #662. There could be more room for improvement like reducing the amount of registries.
We should still check for effective resource utilization and either pursue further tuning or go ahead and reduce the resources configured for the job as we request a whole node but were barely using it
We should still check for effective resource utilization and either pursue further tuning or go ahead and reduce the resources configured for the job as we request a whole node but were barely using it
Good, can you provide us another set of data for, let's say, the patch promotions next week?
Can get a current sample as soon as I'm back at my desk, and thank you for working on this
Sorry .... today did not turn out as expected, I should have known.
So anyhow:
https://prow.k8s.io/?job=*image-promo
=> current run =>
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-k8sio-image-promo/1588044493649612800
=> prowjob yaml link =>
status.pod_name in the prowjob gives us 9fb02c3f-5b3a-11ed-a031-bac5f1dea348
=> let's go check cloud console and find that pod in the k8s infra cluster =>
https://console.cloud.google.com/kubernetes/list/overview?mods=logs_tg_staging&project=k8s-infra-prow-build-trusted
=> workloads tab, filter by 9fb02c3f-5b3a-11ed-a031-bac5f1dea348
(could instead auth to the cluster and get the pod with kubectl) =>
https://console.cloud.google.com/kubernetes/pod/us-central1/prow-build-trusted/test-pods/9fb02c3f-5b3a-11ed-a031-bac5f1dea348/details?mods=logs_tg_staging&project=k8s-infra-prow-build-trusted&pageState=(%22savedViews%22:(%22i%22:%22073c6a714dc44e0d94ce168486658f14%22,%22c%22:%5B%5D,%22n%22:%5B%5D))
=> Node gke-prow-build-trust-trusted-pool1-20-4044b2cc-vqho, let's go find the VM by that name, paste 20-4044b2cc-vqho
in the search box up top, click on the VM result =>
https://console.cloud.google.com/compute/instancesDetail/zones/us-central1-b/instances/gke-prow-build-trust-trusted-pool1-20-4044b2cc-vqho?q=search&referrer=search&project=k8s-infra-prow-build-trusted&mods=logs_tg_staging
Now if we look at the observability tab:
We can take this shortcut because we know there aren't other workloads on this machine besides the system agents, given the job / pod resource requests.
It looks like we still don't use a lot, so we should probably consider turning back down the changes I made to request lots of resources for this pod.
In the meantime we're making quick push-after-promo-PR and quick-promot-presubmit-testing thanks to @saschagrunert's changes to leverage the CI commit details, so I don't think it's pressing to get a super fast 8 core full-resync yet.
(note that that spike is around when the job started, and then it settles down to around 4% CPU and 0.66MiB/s outbound traffic)
(also note that the CPU utilization % metric is relative to the total vCPUs on the VM, not to one core)
@BenTheElder agree, so would it be enough to change the resource
requests
/limits
to match something lower?
What about the memory utilization?
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Reopen this issue with
/reopen
- Mark this issue as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied- After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied- After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closedYou can:
- Reopen this issue with
/reopen
- Mark this issue as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.