ovh / cds

Enterprise-Grade Continuous Delivery & DevOps Automation Open Source Platform

Home Page:https://ovh.github.io/cds/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Hatchery accidentally deletes config secret of worker Pod (Kubernetes)

pgillich opened this issue · comments

Root Cause

There is a cleanup background process, which deletes the secrets, that does not have any reference from Pods. It's a bad design, because this kind of secrets should be deleted by the function, which deletes the Pod. The Pod --> Secret matching is configured in CDS_WORKER_NAME label (in Secret and Pod).

Sometime, when the cleanup background process runs same time at worker creator function, the worker creator function creates the secrets and pods, but the cleanup background process can see only the secret, not the Pod, so it deletes the secret.

In HA controller setup (more etcd), a changing request is processed by the leader etcd, but a read request can be processed by any etcd instance (leader or follower), which can be delayed after the leader, so it's possible to miss a Pod in the Pod list response after creating the Pod by another Kubernetes client session (for a while), see:
https://etcd.io/docs/v3.5/faq/#do-clients-have-to-send-requests-to-the-etcd-leader
https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/ha-topology/
etcd-io/etcd#14501
https://raft.github.io/raft.pdf

The cleanup background process: HatcheryKubernetes.routines --> HatcheryKubernetes.deleteSecrets

The worker creator function: HatcheryKubernetes.SpawnWorker

Checks Before Code Change

Check the CDS_WORKER_NAME label (in Secret and Pod) used correctly in a running system.

Workaround

Decrease the replica number to 1 on hatchery and etcd (if possible). It only decreases the probability.

Solution Alternatives

A) Changing the order of requests

The worker creator function creates the Pod first and Secrets last. The Kubernetes can wait for missing Secrets for a while.

The cleanup background process fetches the Secrets first and Pods last.

It's not 100% solution, because of etcd follower delay + raft load sharing

B) Mark only for delete at first time

The cleanup background process first marks the secret for delete and the next run (after 10 sec) will delete the secret.

It's not 100% solution, if the etcd follower delays more than 10 sec.

C) Delete Secrets immediately after referring Pod delete

Delete the Secrets by the code, which deletes the referring Pod, instead of cleanup background process

Most probable it's 100% solution.

More information

Logs

cds-hatchery-kubernetes-1.log:

2023-09-27 14:38:09 [DEBUG] hatchery> spawnWorkerForJob> 310 action_metadata_job_id=310 caller=github.com/ovh/cds/sdk/hatchery.spawnWorkerForJob goroutine=mainRoutine service=hatchery:kubernetes source_file=/home/jenkins/workspace/_NCD_CDS_Build-Github-CDS_master/sdk/hatchery/starter.go source_line=120

2023-09-27 14:38:09 [DEBUG] hatchery> spawnWorkerForJob> 1695825489 - send book job 310 action_metadata_job_id=310 caller=github.com/ovh/cds/sdk/hatchery.spawnWorkerForJob goroutine=mainRoutine service=hatchery:kubernetes source_file=/home/jenkins/workspace/_NCD_CDS_Build-Github-CDS_master/sdk/hatchery/starter.go source_line=160

2023-09-27 14:38:10 [INFO] starting worker "ncd-group-ncd-custom-worker-keen-and-fervent-hugle" from model "NCD-Group/NCD-custom-worker" (project: NCD, workflow: DeployOrUpgradeRB , job:Check-Existing-NS, jobID:310) action_metadata_job_id=310 caller=github.com/ovh/cds/sdk/hatchery.spawnWorkerForJob goroutine=mainRoutine service=hatchery:kubernetes source_file=/home/jenkins/workspace/_NCD_CDS_Build-Github-CDS_master/sdk/hatchery/starter.go source_line=192

2023-09-27 14:38:10 [INFO] creating pod ncd-group-ncd-custom-worker-keen-and-fervent-hugle action_metadata_job_id=310 auth_worker_name=ncd-group-ncd-custom-worker-keen-and-fervent-hugle caller=github.com/ovh/cds/engine/hatchery/kubernetes.(*kubernetesClient).PodCreate goroutine=mainRoutine k8s_ns=kkurti-cds-fp5 k8s_pod=ncd-group-ncd-custom-worker-keen-and-fervent-hugle service=hatchery:kubernetes source_file=/home/jenkins/workspace/_NCD_CDS_Build-Github-CDS_master/engine/hatchery/kubernetes/kubernetes_client.go source_line=128

2023-09-27 14:38:10 [DEBUG] hatchery> kubernetes> SpawnWorker> ncd-group-ncd-custom-worker-keen-and-fervent-hugle > Pod created action_metadata_job_id=310 auth_worker_name=ncd-group-ncd-custom-worker-keen-and-fervent-hugle caller=github.com/ovh/cds/engine/hatchery/kubernetes.(*HatcheryKubernetes).SpawnWorker goroutine=mainRoutine service=hatchery:kubernetes source_file=/home/jenkins/workspace/_NCD_CDS_Build-Github-CDS_master/engine/hatchery/kubernetes/kubernetes.go source_line=409

2023-09-27 14:38:10 [INFO] hatchery> spawnWorkerForJob> 310 (0.716 seconds elapsed) action_metadata_job_id=310 caller=github.com/ovh/cds/sdk/hatchery.spawnWorkerForJob goroutine=mainRoutine service=hatchery:kubernetes source_file=/home/jenkins/workspace/_NCD_CDS_Build-Github-CDS_master/sdk/hatchery/starter.go source_line=229

cds-hatchery-kubernetes-0.log:

2023-09-27 14:38:11 [INFO] listing pod in namespace kkurti-cds-fp5 caller=github.com/ovh/cds/engine/hatchery/kubernetes.(*kubernetesClient).PodList goroutine=deleteSecrets k8s_ns=kkurti-cds-fp5 source_file=/home/jenkins/workspace/_NCD_CDS_Build-Github-CDS_master/engine/hatchery/kubernetes/kubernetes_client.go source_line=143

2023-09-27 14:38:11 [INFO] listing pod in namespace kkurti-cds-fp5 caller=github.com/ovh/cds/engine/hatchery/kubernetes.(*kubernetesClient).PodList goroutine=killAwolWorker k8s_ns=kkurti-cds-fp5 source_file=/home/jenkins/workspace/_NCD_CDS_Build-Github-CDS_master/engine/hatchery/kubernetes/kubernetes_client.go source_line=143

2023-09-27 14:38:11 [DEBUG] delete secret "cds-worker-config-ncd-group-ncd-custom-worker-keen-and-fervent-hugle" caller=github.com/ovh/cds/engine/hatchery/kubernetes.(*HatcheryKubernetes).deleteSecrets goroutine=deleteSecrets source_file=/home/jenkins/workspace/_NCD_CDS_Build-Github-CDS_master/engine/hatchery/kubernetes/secrets.go source_line=51

Kubernetes events

keen-and-fervent-hugle-1.zip

Notable events:

Note stageTimestamp requestReceivedTimestamp verb objectRef.resource objectRef.name user.username user.extra.authentication.kubernetes.io/pod-name sourceIPs responseStatus.code
Delete previous Secret, if any 2023-09-27T14:38:10.089662Z 2023-09-27T14:38:10.084513Z delete secrets cds-worker-config-ncd-group-ncd-custom-worker-keen-and-fervent-hugle system:serviceaccount:kkurti-cds-fp5:internal-kubectl cds-hatchery-kubernetes-1 10.255.20.111 404
Create Secret 2023-09-27T14:38:10.104058Z 2023-09-27T14:38:10.091463Z create secrets cds-worker-config-ncd-group-ncd-custom-worker-keen-and-fervent-hugle system:serviceaccount:kkurti-cds-fp5:internal-kubectl cds-hatchery-kubernetes-1 10.255.20.111 201
Create Pod 2023-09-27T14:38:10.165973Z 2023-09-27T14:38:10.108109Z create pods ncd-group-ncd-custom-worker-keen-and-fervent-hugle system:serviceaccount:kkurti-cds-fp5:internal-kubectl cds-hatchery-kubernetes-1 10.255.20.111 201
Schedule Pod create 2023-09-27T14:38:10.185702Z 2023-09-27T14:38:10.174581Z create pods ncd-group-ncd-custom-worker-keen-and-fervent-hugle system:kube-scheduler 172.16.119.96 201
Event about Pod create schedule 2023-09-27T14:38:10.198886Z 2023-09-27T14:38:10.188999Z create events ncd-group-ncd-custom-worker-keen-and-fervent-hugle.1788c84c67adcf51 system:kube-scheduler 172.16.119.96 201
List Pods 2023-09-27T14:38:11.675460Z 2023-09-27T14:38:11.658498Z list pods system:serviceaccount:kkurti-cds-fp5:internal-kubectl cds-hatchery-kubernetes-0 10.255.16.37 200
List Pods 2023-09-27T14:38:11.685867Z 2023-09-27T14:38:11.658055Z list pods system:serviceaccount:kkurti-cds-fp5:internal-kubectl cds-hatchery-kubernetes-0 10.255.16.37 200
List Secrets 2023-09-27T14:38:11.695895Z 2023-09-27T14:38:11.687951Z list secrets system:serviceaccount:kkurti-cds-fp5:internal-kubectl cds-hatchery-kubernetes-0 10.255.16.37 200
Delete Secret 2023-09-27T14:38:11.703874Z 2023-09-27T14:38:11.697844Z delete secrets cds-worker-config-ncd-group-ncd-custom-worker-keen-and-fervent-hugle system:serviceaccount:kkurti-cds-fp5:internal-kubectl cds-hatchery-kubernetes-0 10.255.16.37 200

Thank you @pgillich for this detailled issue. You're right, the killAwolWorkers probably needs to delete the secrets sync. We'll look at it, although the workaround is probably effective, but it can still happen.

For the record, we have implemented a rather minimalistic workaround in the meantime with a configurable grace period before apparently orphaned secrets are deleted. Unfortunately, the team that would be able to verify that it actually solves the issue has not had the time to check it for over a week now.

It (supposedly) works by filtering out secrets that are more recent than the configured grace period, like this:

engine/hatchery/kubernetes/secrets.go:
secrets, err := h.kubeClient.SecretList(ctx, h.Config.Namespace, metav1.ListOptions{FieldSelector: fmt.Sprintf("metadata.creationTimestamp<=%s", time.Now().Add(-h.Config.SecretGracePeriod).Format(time.RFC3339)), LabelSelector: LABEL_HATCHERY_NAME})

engine/hatchery/kubernetes/types.go:
SecretGracePeriod time.Duration `mapstructure:"secretGracePeriod" toml:"secretGracePeriod" default:"10m" commented:"false" comment:"Secrets will not be cleaned up even if no worker pods refer to them if they are not at least this old" json:"secretGracePeriod"`

Update: After finally testing the above code we found out that FieldSelector does not support the creationTimestamp field, so we switched to filtering the secrets in code.

will be available in 0.54.0.