Hatchery accidentally deletes config secret of worker Pod (Kubernetes)
pgillich opened this issue · comments
Root Cause
There is a cleanup background process, which deletes the secrets, that does not have any reference from Pods. It's a bad design, because this kind of secrets should be deleted by the function, which deletes the Pod. The Pod --> Secret matching is configured in CDS_WORKER_NAME
label (in Secret and Pod).
Sometime, when the cleanup background process runs same time at worker creator function, the worker creator function creates the secrets and pods, but the cleanup background process can see only the secret, not the Pod, so it deletes the secret.
In HA controller setup (more etcd), a changing request is processed by the leader etcd, but a read request can be processed by any etcd instance (leader or follower), which can be delayed after the leader, so it's possible to miss a Pod in the Pod list response after creating the Pod by another Kubernetes client session (for a while), see:
https://etcd.io/docs/v3.5/faq/#do-clients-have-to-send-requests-to-the-etcd-leader
https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/ha-topology/
etcd-io/etcd#14501
https://raft.github.io/raft.pdf
The cleanup background process: HatcheryKubernetes.routines
--> HatcheryKubernetes.deleteSecrets
The worker creator function: HatcheryKubernetes.SpawnWorker
Checks Before Code Change
Check the CDS_WORKER_NAME
label (in Secret and Pod) used correctly in a running system.
Workaround
Decrease the replica number to 1 on hatchery and etcd (if possible). It only decreases the probability.
Solution Alternatives
A) Changing the order of requests
The worker creator function creates the Pod first and Secrets last. The Kubernetes can wait for missing Secrets for a while.
The cleanup background process fetches the Secrets first and Pods last.
It's not 100% solution, because of etcd follower delay + raft load sharing
B) Mark only for delete at first time
The cleanup background process first marks the secret for delete and the next run (after 10 sec) will delete the secret.
It's not 100% solution, if the etcd follower delays more than 10 sec.
C) Delete Secrets immediately after referring Pod delete
Delete the Secrets by the code, which deletes the referring Pod, instead of cleanup background process
Most probable it's 100% solution.
More information
Logs
cds-hatchery-kubernetes-1.log:
2023-09-27 14:38:09 [DEBUG] hatchery> spawnWorkerForJob> 310 action_metadata_job_id=310 caller=github.com/ovh/cds/sdk/hatchery.spawnWorkerForJob goroutine=mainRoutine service=hatchery:kubernetes source_file=/home/jenkins/workspace/_NCD_CDS_Build-Github-CDS_master/sdk/hatchery/starter.go source_line=120
2023-09-27 14:38:09 [DEBUG] hatchery> spawnWorkerForJob> 1695825489 - send book job 310 action_metadata_job_id=310 caller=github.com/ovh/cds/sdk/hatchery.spawnWorkerForJob goroutine=mainRoutine service=hatchery:kubernetes source_file=/home/jenkins/workspace/_NCD_CDS_Build-Github-CDS_master/sdk/hatchery/starter.go source_line=160
2023-09-27 14:38:10 [INFO] starting worker "ncd-group-ncd-custom-worker-keen-and-fervent-hugle" from model "NCD-Group/NCD-custom-worker" (project: NCD, workflow: DeployOrUpgradeRB , job:Check-Existing-NS, jobID:310) action_metadata_job_id=310 caller=github.com/ovh/cds/sdk/hatchery.spawnWorkerForJob goroutine=mainRoutine service=hatchery:kubernetes source_file=/home/jenkins/workspace/_NCD_CDS_Build-Github-CDS_master/sdk/hatchery/starter.go source_line=192
2023-09-27 14:38:10 [INFO] creating pod ncd-group-ncd-custom-worker-keen-and-fervent-hugle action_metadata_job_id=310 auth_worker_name=ncd-group-ncd-custom-worker-keen-and-fervent-hugle caller=github.com/ovh/cds/engine/hatchery/kubernetes.(*kubernetesClient).PodCreate goroutine=mainRoutine k8s_ns=kkurti-cds-fp5 k8s_pod=ncd-group-ncd-custom-worker-keen-and-fervent-hugle service=hatchery:kubernetes source_file=/home/jenkins/workspace/_NCD_CDS_Build-Github-CDS_master/engine/hatchery/kubernetes/kubernetes_client.go source_line=128
2023-09-27 14:38:10 [DEBUG] hatchery> kubernetes> SpawnWorker> ncd-group-ncd-custom-worker-keen-and-fervent-hugle > Pod created action_metadata_job_id=310 auth_worker_name=ncd-group-ncd-custom-worker-keen-and-fervent-hugle caller=github.com/ovh/cds/engine/hatchery/kubernetes.(*HatcheryKubernetes).SpawnWorker goroutine=mainRoutine service=hatchery:kubernetes source_file=/home/jenkins/workspace/_NCD_CDS_Build-Github-CDS_master/engine/hatchery/kubernetes/kubernetes.go source_line=409
2023-09-27 14:38:10 [INFO] hatchery> spawnWorkerForJob> 310 (0.716 seconds elapsed) action_metadata_job_id=310 caller=github.com/ovh/cds/sdk/hatchery.spawnWorkerForJob goroutine=mainRoutine service=hatchery:kubernetes source_file=/home/jenkins/workspace/_NCD_CDS_Build-Github-CDS_master/sdk/hatchery/starter.go source_line=229
cds-hatchery-kubernetes-0.log:
2023-09-27 14:38:11 [INFO] listing pod in namespace kkurti-cds-fp5 caller=github.com/ovh/cds/engine/hatchery/kubernetes.(*kubernetesClient).PodList goroutine=deleteSecrets k8s_ns=kkurti-cds-fp5 source_file=/home/jenkins/workspace/_NCD_CDS_Build-Github-CDS_master/engine/hatchery/kubernetes/kubernetes_client.go source_line=143
2023-09-27 14:38:11 [INFO] listing pod in namespace kkurti-cds-fp5 caller=github.com/ovh/cds/engine/hatchery/kubernetes.(*kubernetesClient).PodList goroutine=killAwolWorker k8s_ns=kkurti-cds-fp5 source_file=/home/jenkins/workspace/_NCD_CDS_Build-Github-CDS_master/engine/hatchery/kubernetes/kubernetes_client.go source_line=143
2023-09-27 14:38:11 [DEBUG] delete secret "cds-worker-config-ncd-group-ncd-custom-worker-keen-and-fervent-hugle" caller=github.com/ovh/cds/engine/hatchery/kubernetes.(*HatcheryKubernetes).deleteSecrets goroutine=deleteSecrets source_file=/home/jenkins/workspace/_NCD_CDS_Build-Github-CDS_master/engine/hatchery/kubernetes/secrets.go source_line=51
Kubernetes events
Notable events:
Note | stageTimestamp | requestReceivedTimestamp | verb | objectRef.resource | objectRef.name | user.username | user.extra.authentication.kubernetes.io/pod-name | sourceIPs | responseStatus.code |
---|---|---|---|---|---|---|---|---|---|
Delete previous Secret, if any | 2023-09-27T14:38:10.089662Z | 2023-09-27T14:38:10.084513Z | delete | secrets | cds-worker-config-ncd-group-ncd-custom-worker-keen-and-fervent-hugle | system:serviceaccount:kkurti-cds-fp5:internal-kubectl | cds-hatchery-kubernetes-1 | 10.255.20.111 | 404 |
Create Secret | 2023-09-27T14:38:10.104058Z | 2023-09-27T14:38:10.091463Z | create | secrets | cds-worker-config-ncd-group-ncd-custom-worker-keen-and-fervent-hugle | system:serviceaccount:kkurti-cds-fp5:internal-kubectl | cds-hatchery-kubernetes-1 | 10.255.20.111 | 201 |
Create Pod | 2023-09-27T14:38:10.165973Z | 2023-09-27T14:38:10.108109Z | create | pods | ncd-group-ncd-custom-worker-keen-and-fervent-hugle | system:serviceaccount:kkurti-cds-fp5:internal-kubectl | cds-hatchery-kubernetes-1 | 10.255.20.111 | 201 |
Schedule Pod create | 2023-09-27T14:38:10.185702Z | 2023-09-27T14:38:10.174581Z | create | pods | ncd-group-ncd-custom-worker-keen-and-fervent-hugle | system:kube-scheduler | 172.16.119.96 | 201 | |
Event about Pod create schedule | 2023-09-27T14:38:10.198886Z | 2023-09-27T14:38:10.188999Z | create | events | ncd-group-ncd-custom-worker-keen-and-fervent-hugle.1788c84c67adcf51 | system:kube-scheduler | 172.16.119.96 | 201 | |
List Pods | 2023-09-27T14:38:11.675460Z | 2023-09-27T14:38:11.658498Z | list | pods | system:serviceaccount:kkurti-cds-fp5:internal-kubectl | cds-hatchery-kubernetes-0 | 10.255.16.37 | 200 | |
List Pods | 2023-09-27T14:38:11.685867Z | 2023-09-27T14:38:11.658055Z | list | pods | system:serviceaccount:kkurti-cds-fp5:internal-kubectl | cds-hatchery-kubernetes-0 | 10.255.16.37 | 200 | |
List Secrets | 2023-09-27T14:38:11.695895Z | 2023-09-27T14:38:11.687951Z | list | secrets | system:serviceaccount:kkurti-cds-fp5:internal-kubectl | cds-hatchery-kubernetes-0 | 10.255.16.37 | 200 | |
Delete Secret | 2023-09-27T14:38:11.703874Z | 2023-09-27T14:38:11.697844Z | delete | secrets | cds-worker-config-ncd-group-ncd-custom-worker-keen-and-fervent-hugle | system:serviceaccount:kkurti-cds-fp5:internal-kubectl | cds-hatchery-kubernetes-0 | 10.255.16.37 | 200 |
Thank you @pgillich for this detailled issue. You're right, the killAwolWorkers probably needs to delete the secrets sync. We'll look at it, although the workaround is probably effective, but it can still happen.
For the record, we have implemented a rather minimalistic workaround in the meantime with a configurable grace period before apparently orphaned secrets are deleted. Unfortunately, the team that would be able to verify that it actually solves the issue has not had the time to check it for over a week now.
It (supposedly) works by filtering out secrets that are more recent than the configured grace period, like this:
engine/hatchery/kubernetes/secrets.go:
secrets, err := h.kubeClient.SecretList(ctx, h.Config.Namespace, metav1.ListOptions{FieldSelector: fmt.Sprintf("metadata.creationTimestamp<=%s", time.Now().Add(-h.Config.SecretGracePeriod).Format(time.RFC3339)), LabelSelector: LABEL_HATCHERY_NAME})
engine/hatchery/kubernetes/types.go:
SecretGracePeriod time.Duration `mapstructure:"secretGracePeriod" toml:"secretGracePeriod" default:"10m" commented:"false" comment:"Secrets will not be cleaned up even if no worker pods refer to them if they are not at least this old" json:"secretGracePeriod"`
Update: After finally testing the above code we found out that FieldSelector does not support the creationTimestamp field, so we switched to filtering the secrets in code.
will be available in 0.54.0.