kubernetes-sigs / promo-tools

When running in Kubernetes CI image promotion takes more than one hour, often multiple hours.

Worse, it takes an hour just running presubmit tests in k8s.io repo. As a result a quickly approved image promotion PR may take 3 hours or more to take affect.

I took a cursory pass at this by: kubernetes/test-infra#27743 + kubernetes/test-infra#27765 which set large (7 core, 40 Gi RAM) allocations, configure GOMAXPROCS to match, and increase --threads (worker goroutines).

Since those changes, we can follow from a running job, to it's pod, to the node the pod is on, to the backing GCE VM.
On that VM, there should only be kubelet / system daemons, other workloads will not be scheduled here due to essentially no schedule-able resources being available (VM has 8 cores, we request 7, some are reserved for system agents).

We can see that CPU usage spikes when the job starts (git cloning?) and then the entire machine settles to just 3-7% utilization, outgoing connections only ~2/s to ~3/s, disk write < 0.6MiB/s, network egress < 0.7MiB/s far below the initial spike:

This implies we are bottlenecking heavily on something like waiting for the sigstore API ...?

Copying images could otherwise conceivably bottleneck on CPU or Network but both of those are hardly in use.

I half-joked about synchronized logging and the volume of logs, but I don't seriously think it's contention over the logger.

Besides investigating what exactly is causing this slow-down, I think we should consider enabling running N postsubmit jobs each responsible for a subset of the images, only triggered when the respective manifest directories change, based on run_if_changed plus some option to the image promoter to specify which manifests to read.

Opened #638 to log timestamps before each step of the promotion process.

/cc @kubernetes-sigs/release-engineering

OK, we have timestamps: https://prow.k8s.io/log?job=post-promo-tools-image-promo-canary&id=1582627370186051584

curl 'https://prow.k8s.io/log?job=post-promo-tools-image-promo-canary&id=1582627370186051584' | grep 'kpromo\['

level=info msg="kpromo[1666162785]: PromoteImages start"
level=info msg="kpromo[1666162785]: Parsing manifests"
level=info msg="kpromo[1666162785]: Creating sync context manifests"
level=info msg="kpromo[1666162785]: Getting promotion edges"
level=info msg="kpromo[1666162786]: Creating producer function"
level=info msg="kpromo[1666162786]: Validating staging signatures"
level=info msg="kpromo[1666162806]: Promoting images"
level=info msg="kpromo[1666162818]: Replicating signatures"
level=info msg="kpromo[1666162820]: Signing images"
level=info msg="kpromo[1666162842]: Finish"

A note about these times. These are from a run of our canary job. It is a test promotion using a canary image of the promoter we build at head. The canary job promotes an image to two mirrors that have almost nothing in them. But we can see the latency of the promotion plus the signing operations:

Promoting the image takes 12 seconds
Signing and verifying operations:

Validating signatures from staging: 20 secs
Replication of staging signatures: 2 secs
Signing, which includes replication to the mirrors: 22 secs

A few notes here:

# 1 seems unusually long and is worth investigating further. Especially since there are no signatures in the staging project. If it still takes this long it may be a bug and a good target to fix now.
# 2 is quick in this example because there are no signed images in the staging repo. If there were, replicating the signatures implies effectively doing another image promotion which takes time.
# 3 (signing) is the slowest part (but see previous two notes) it implies talking to sigstore and pushing the signature to the registry . This step includes not just signing but also replicating the signature "image" from the first mirror to the rest of them.

Looking at the issue "Validating signatures from staging: 20 secs" mentioned in the comment above. The method we measure is ValidateStagingSignatures:

promo-tools/internal/promoter/image/sign.go

Lines 123 to 125 in b4fc93c

    
           func (di *DefaultPromoterImplementation) ValidateStagingSignatures( 
        
           	edges map[reg.PromotionEdge]interface{}, 
        
           ) (map[reg.PromotionEdge]interface{}, error) {

It's suspicious that we create a signer in ValidateStagingSignatures as well as FindSingedEdges, especially the multi use of the throttlers in combination with the signing API is kinda hard to debug:

promo-tools/internal/promoter/image/sign.go

Lines 74 to 91 in b4fc93c

    
           signer := sign.New(sign.Default()) 
        
           // Check if we've verified this edge before 
        
           seenEdges.RLock() 
        
           // .. if we have, we can skip it. 
        
           if _, ok := seenEdges.list[edge.SrcReference()]; ok { 
        
           	seenEdges.RUnlock() 
        
           	t.Done(nil) 
        
           	return 
        
           } 
        
           seenEdges.RUnlock() 
        
           // If not, add it to the list 
        
           seenEdges.Lock() 
        
           seenEdges.list[edge.SrcReference()] = struct{}{} 
        
           seenEdges.Unlock() 
        
           isSigned, err := signer.IsImageSigned(edge.SrcReference())

I propse to start with some more verbose logging and tracing the timestamps between log messages in #640

signer.IsImageSigned is indeed slow (~1s per image). I was able to trace the path as following:

That's where it gets interesting:

The ping definitely takes some time
Another time consumer is the bt.refresh() call

Test case code

package main

import (
	"sync"
	"time"

	"github.com/sirupsen/logrus"
	"sigs.k8s.io/release-sdk/sign"
)

type Hook struct {
	lastTime time.Time
	mu       sync.RWMutex
}

func NewHook() *Hook {
	return &Hook{
		lastTime: time.Now(),
		mu:       sync.RWMutex{},
	}
}

func (h *Hook) Fire(e *logrus.Entry) error {
	h.mu.Lock()
	e.Data["diff"] = e.Time.Sub(h.lastTime).Round(time.Millisecond)
	h.lastTime = e.Time
	h.mu.Unlock()
	return nil
}

func (h *Hook) Levels() []logrus.Level {
	return logrus.AllLevels
}

func main() {
	logrus.SetFormatter(&logrus.TextFormatter{
		DisableTimestamp: false,
		FullTimestamp:    true,
		TimestampFormat:  "15:04:05.000",
	})
	logrus.AddHook(NewHook())

	signer := sign.New(sign.Default())

	const img = "docker.io/ubuntu:22.04"

	logrus.Info("Check if image is signed")
	signed, err := signer.IsImageSigned(img)
	if err != nil {
		panic(err)
	}
	logrus.Infof("Is signed: %v", signed)
}

I added some traces around that logic in NewWithContext in the same way as introduced in #640, with the output:

INFO[11:53:37.348] Check if image is signed                      diff=0s
INFO[11:53:37.349] NewWithContext start                          diff=0s
INFO[11:53:37.349] ping start                                    diff=0s
INFO[11:53:37.810] ping end                                      diff=462ms
INFO[11:53:37.810] bt.refresh start                              diff=0s
INFO[11:53:38.283] bt.refresh end                                diff=473ms
INFO[11:53:38.283] NewWithContext end                            diff=0s
INFO[11:53:38.450] NewWithContext start                          diff=167ms
INFO[11:53:38.450] ping start                                    diff=0s
INFO[11:53:38.564] ping end                                      diff=113ms
INFO[11:53:38.564] bt.refresh start                              diff=0s
INFO[11:53:38.704] bt.refresh end                                diff=140ms
INFO[11:53:38.704] NewWithContext end                            diff=0s
INFO[11:53:38.839] Is signed: false                              diff=135ms

The first call to NewWithContext seems to be the critical path. Created a tracking issue in go-containerregistry about that topic: google/go-containerregistry#1466

It makes sense that the first call to setup the transport would need to negotiate auth with the specific registry.

On mobile: are we able to supply and reuse transports on our end? We should probably setup one per worker goroutine and reuse.

See also kubernetes-sigs/release-sdk#105

Looks like up through:

SignedEntity (k/relase-sdk)

We could plumb a WithTransport() (google/go-containerregistry#1466 (comment)) but not:

IsImageSigned (k/release-sdk)

Working on a faster version of image signature verification in kubernetes-sigs/release-sdk#123

We also need a better release-sdk API for the usage of VerifyImage:

promo-tools/internal/promoter/image/sign.go

Lines 135 to 157 in 289d224

    
           // Verify the signatures in the edges we are looking at 
        
           t := throttler.New(maxParallelVerifications, len(signedEdges)) 
        
           for e := range signedEdges { 
        
           	go func(edge reg.PromotionEdge) { 
        
           		logrus.Infof("Verifying signatures of image %s", edge.SrcReference()) 
        
           		// Check the staged image signatures 
        
           		if _, err := signer.VerifyImage(edge.SrcReference()); err != nil { 
        
           			t.Done(fmt.Errorf( 
        
           				"verifying signatures of image %s: %w", edge.SrcReference(), err, 
        
           			)) 
        
           			return 
        
           		} 
        
           		logrus.Infof("Signatures for ref %s verfified", edge.SrcReference()) 
        
           		t.Done(nil) 
        
           	}(e) 
        
           	if t.Throttle() > 0 { 
        
           		break 
        
           	} 
        
           } 
        
           if err := t.Err(); err != nil { 
        
           	return nil, err 
        
           }

The method again checks if the image is signed (not required in our case) as well as does not reuse the transport.

Edit: Some ideas are now in kubernetes-sigs/release-sdk#124

I think one of the recent changes upped failed runs

https://prow.k8s.io/?job=*image-promo

We have failures like this:

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-k8sio-image-promo/1583138923981312000

time="17:17:43.175" level=fatal msg="run cip run: filtering edges: filtering promotion edges: reading registries: getting tag list: GET https://us-east1-docker.pkg.dev/v2/token?scope=repository%3Ak8s-artifacts-prod%2Fimages%2Fe2e-test-images%2Fvolume%2Frbd%3Apull&service=: TOOMANYREQUESTS: Quota exceeded for quota metric 'Requests per project per region' and limit 'Requests per project per region per minute per region' of service 'artifactregistry.googleapis.com' for consumer 'project_number:388270116193'." diff=97ms

IIRC Per jon the quota is pretty reasonable so 🤔 (forgetting the hard number, I asked)

Current quota consumption for Artifact Registry

https://cloud.google.com/artifact-registry/quotas

60000 requests per minute in each region or multi-region.

1000 qps per region ... what are we doing that is hitting > 1000 qps in a region?

https://prow.k8s.io/job-history/gs/kubernetes-jenkins/logs/ci-k8sio-image-promo

besides the sample above:

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-k8sio-image-promo/1582921237355565056

time="03:02:13.916" level=fatal msg="run cip run: filtering edges: filtering promotion edges: reading registries: getting tag list: Get "https://us-central1-docker.pkg.dev/v2/\": read tcp 10.4.2.155:33274->142.250.152.82:443: read: connection reset by peer" diff=2ms

We just need to tolerate connection blips probably, in that case.

yesterday:

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-k8sio-image-promo/1582781607704530944

level=fatal msg="run cip run: filtering edges: filtering promotion edges: reading registries: getting tag list: GET https://us-east1-docker.pkg.dev/v2/token?scope=repository%3Ak8s-artifacts-prod%2Fimages%2Fexperimental%2Fconformance-ppc64le%3Apull&service=: TOOMANYREQUESTS: Quota exceeded for quota metric 'Requests per project per region' and limit 'Requests per project per region per minute per region' of service 'artifactregistry.googleapis.com' for consumer 'project_number:388270116193'."

before that, last tuesday:

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-k8sio-image-promo/1582405078290010112

level=fatal msg="run cip run: filtering edges: filtering promotion edges: reading registries: getting tag list: GET https://us-east1-docker.pkg.dev/v2/token?scope=repository%3Ak8s-artifacts-prod%2Fimages%2Fautoscaling%2Fvpa-recommender-s390x%3Apull&service=: TOOMANYREQUESTS: Quota exceeded for quota metric 'Requests per project per region' and limit 'Requests per project per region per minute per region' of service 'artifactregistry.googleapis.com' for consumer 'project_number:388270116193'."

us-east1 again

The promoter will look for every image, in every mirror to see how the target registry compares to the source. I think that is it, we have a ton of images and the new mirrors multiplied those lookups about 9x which is why the logs now are humongous too 😓

The promoter will look for every image, in every mirror to see how the target registry compares to the source. I think that is it, we have a ton of images and the new mirrors multiplied those lookups about 9x which is why the logs now are humongous too 😓

This quota is per region though, so we have to be exceeding 1000 qps in one region

registry.k8s.io users are consuming a non-zero amount, and that will increase with time, but in us-east1 it's currently only ~10 qps even reaching registry.k8s.io and not all of that will reach AR.

Yeah I assume we contact the registry multiple times per image which should be clearly improved. But for example for checking if a signature exist we already need 2 requests: One for the digest of the main image and one for the signature digest.

Some enhancements regarding the signature verification got merged in release-sdk and promo tools. I still have some ideas how to improve the signing part.

Amazing, I will kick off a canary run later today and we can see it in action. Thank you Sascha, see you soon! 🎉 😄

I think we can close this now, especially since we have #662. There could be more room for improvement like reducing the amount of registries.

We should still check for effective resource utilization and either pursue further tuning or go ahead and reduce the resources configured for the job as we request a whole node but were barely using it

We should still check for effective resource utilization and either pursue further tuning or go ahead and reduce the resources configured for the job as we request a whole node but were barely using it

Good, can you provide us another set of data for, let's say, the patch promotions next week?

Can get a current sample as soon as I'm back at my desk, and thank you for working on this

Sorry .... today did not turn out as expected, I should have known.
So anyhow:

https://prow.k8s.io/?job=*image-promo

=> current run =>
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-k8sio-image-promo/1588044493649612800

=> prowjob yaml link =>
status.pod_name in the prowjob gives us 9fb02c3f-5b3a-11ed-a031-bac5f1dea348

=> let's go check cloud console and find that pod in the k8s infra cluster =>
https://console.cloud.google.com/kubernetes/list/overview?mods=logs_tg_staging&project=k8s-infra-prow-build-trusted

=> workloads tab, filter by 9fb02c3f-5b3a-11ed-a031-bac5f1dea348 (could instead auth to the cluster and get the pod with kubectl) =>
https://console.cloud.google.com/kubernetes/pod/us-central1/prow-build-trusted/test-pods/9fb02c3f-5b3a-11ed-a031-bac5f1dea348/details?mods=logs_tg_staging&project=k8s-infra-prow-build-trusted&pageState=(%22savedViews%22:(%22i%22:%22073c6a714dc44e0d94ce168486658f14%22,%22c%22:%5B%5D,%22n%22:%5B%5D))

=> Node gke-prow-build-trust-trusted-pool1-20-4044b2cc-vqho, let's go find the VM by that name, paste 20-4044b2cc-vqho in the search box up top, click on the VM result =>
https://console.cloud.google.com/compute/instancesDetail/zones/us-central1-b/instances/gke-prow-build-trust-trusted-pool1-20-4044b2cc-vqho?q=search&referrer=search&project=k8s-infra-prow-build-trusted&mods=logs_tg_staging

Now if we look at the observability tab:

We can take this shortcut because we know there aren't other workloads on this machine besides the system agents, given the job / pod resource requests.

It looks like we still don't use a lot, so we should probably consider turning back down the changes I made to request lots of resources for this pod.

In the meantime we're making quick push-after-promo-PR and quick-promot-presubmit-testing thanks to @saschagrunert's changes to leverage the CI commit details, so I don't think it's pressing to get a super fast 8 core full-resync yet.

(note that that spike is around when the job started, and then it settles down to around 4% CPU and 0.66MiB/s outbound traffic)
(also note that the CPU utilization % metric is relative to the total vCPUs on the VM, not to one core)

@BenTheElder agree, so would it be enough to change the resource requests/limits to match something lower?

https://github.com/kubernetes/test-infra/blob/a2710ee42153299f3f77313f83f94d6398c8550f/config/jobs/kubernetes/sig-k8s-infra/trusted/releng/releng-trusted.yaml#L167-L174

What about the memory utilization?

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

	func (di *DefaultPromoterImplementation) ValidateStagingSignatures(
	edges map[reg.PromotionEdge]interface{},
	) (map[reg.PromotionEdge]interface{}, error) {

	signer := sign.New(sign.Default())

	// Check if we've verified this edge before
	seenEdges.RLock()
	// .. if we have, we can skip it.
	if _, ok := seenEdges.list[edge.SrcReference()]; ok {
	seenEdges.RUnlock()
	t.Done(nil)
	return
	}
	seenEdges.RUnlock()

	// If not, add it to the list
	seenEdges.Lock()
	seenEdges.list[edge.SrcReference()] = struct{}{}
	seenEdges.Unlock()

	isSigned, err := signer.IsImageSigned(edge.SrcReference())

	// Verify the signatures in the edges we are looking at
	t := throttler.New(maxParallelVerifications, len(signedEdges))
	for e := range signedEdges {
	go func(edge reg.PromotionEdge) {
	logrus.Infof("Verifying signatures of image %s", edge.SrcReference())

	// Check the staged image signatures
	if _, err := signer.VerifyImage(edge.SrcReference()); err != nil {
	t.Done(fmt.Errorf(
	"verifying signatures of image %s: %w", edge.SrcReference(), err,
	))
	return
	}
	logrus.Infof("Signatures for ref %s verfified", edge.SrcReference())
	t.Done(nil)
	}(e)
	if t.Throttle() > 0 {
	break
	}
	}
	if err := t.Err(); err != nil {
	return nil, err
	}

image promotion is very slow despite low resource utilization