"Sporadic" controller crash when Pods are in "Terminating" status

Question

"Sporadic" controller crash when Pods are in "Terminating" status

Nicolasgouze opened this issue 3 months ago · comments

@denniskern : Following our short conversation, can you please provide logs and as much info as possible regarding this issue ?

On our side :

we do not reproduce this issue (tested with 1.6 & 1.7 releases)
@paullaffitte checked the code.

We want to be sure we do not miss anything ...

Paul Laffitte · Answer 1 · Mon Apr 08 2024 19:36:34 GMT+0800 (China Standard Time)

May be related to #308

Paul Laffitte · Answer 2 · Tue Apr 09 2024 20:35:52 GMT+0800 (China Standard Time)

If this issue appeared with version 1.7.0, I think we can confirm that it's the same issue as #308 and thus resolved by #311. @denniskern could you confirm please?

Dennis Kern · Answer 3 · Thu Apr 18 2024 16:51:52 GMT+0800 (China Standard Time)

Sorry for late response - I was on leave.

Yes exactly this happend - If you guys have a version out there where I can test #311 I would do it.

Nicolasgouze · Answer 4 · Fri Apr 19 2024 20:08:25 GMT+0800 (China Standard Time)

Hi @denniskern , The PR #311 (including a fix) was merged today, and will be available in next release. I will wait for your test to close this ticket.

Dennis Kern · Answer 5 · Tue Apr 23 2024 18:31:25 GMT+0800 (China Standard Time)

@Nicolasgouze that we are not pass each other - I need a release to test it :-)

Martin Weber · Answer 6 · Thu Apr 25 2024 12:41:19 GMT+0800 (China Standard Time)

@Nicolasgouze @paullaffitte Would you be able to release a beta version so we're able to test that easier? I saw that you did that earlier already. Would help us in getting you the required information more quick.

Nicolasgouze · Answer 7 · Fri Apr 26 2024 21:18:23 GMT+0800 (China Standard Time)

Hello @spr-mweber3 , we'll release a beta version next monday. stay tuned !

Paul Laffitte · Answer 8 · Mon Apr 29 2024 17:07:16 GMT+0800 (China Standard Time)

@denniskern @spr-mweber3 v1.8.1-beta.1 is available!

Dennis Kern · Answer 9 · Mon May 06 2024 22:38:55 GMT+0800 (China Standard Time)

I tested version v1.8.1-beta.1 and the crash of the controller still exists. The controller crash with this message:

2024-05-06T14:31:02.140Z ERROR setup problem running manager {"error": "Pod "kube-prometheus-stack-admission-create-stq7n" is invalid: spec: Forbidden: pod updates may not change fields other than spec.containers[*].image,spec.initContainers[*].image,spec.activeDeadlineSeconds,spec.tolerations (only additions to existing tolerations),spec.terminationGracePeriodSeconds (allow it to be set to 1 if it was previously negative)\n core.PodSpec{\n \t... // 6 identical fields\n \tActiveDeadlineSeconds: nil,\n \tDNSPolicy: "ClusterFirst",\n- \tNodeSelector: nil,\n+ \tNodeSelector: map[string]string{"workergroup": "wg1"},\n \tServiceAccountName: "kube-prometheus-stack-admission",\n \tAutomountServiceAccountToken: nil,\n \t... // 22 identical fields\n }\n"}

The pod kube-prometheus-stack-admission-create-stq7n is in state terminating.

Paul Laffitte · Answer 10 · Thu May 16 2024 22:21:37 GMT+0800 (China Standard Time)

It looks like we have another problem here. But it is very surprising because this error appears in the initialization step as suggests the log "setup problem running manager" and the only update that we do on pods during initialization is no-op (p.Client.Patch(context.Background(), &pod, client.RawPatch(types.JSONPatchType, []byte("[]")))). The goal is to trigger to mutating webhook on all existing pods. And in this mutating webhook we only rewrite images and add annotations, which should not be an issue either..

What version of Kubernetes are you using please?

Dennis Kern · Answer 11 · Fri May 17 2024 14:52:45 GMT+0800 (China Standard Time)

We are using 1.27

Paul Laffitte · Answer 12 · Wed May 22 2024 20:52:46 GMT+0800 (China Standard Time)

Sorry but I cannot reproduce your issue on a cluster in version 1.27. Is there anything specific in your setup? If you could produce a minimal reproducible example it would greatly help.

Nicolasgouze · Answer 13 · Fri May 31 2024 20:14:21 GMT+0800 (China Standard Time)

Hi @denniskern , Do you have any further info to provide so that we try to reproduce & finally correct the issue.
Thanks in advance !

Dennis Kern · Answer 14 · Fri May 31 2024 22:52:28 GMT+0800 (China Standard Time)

Hi guys @paullaffitte @Nicolasgouze

I figured out that the problem has nothing to do with the state of the pod rather then with a clusterpolicy which is handled by kyverno. In our case we have a clusterpolicy which add a NodeSelector to the pod and because of a timing problem the policy was later in place then the pod spawned. So this means when kuik wants to replace the image location the clusterpolicy wants also to update the NodeSelector and this is not allowed which leads to this error.

But something must have changed since version 1.7.0 because we don't see this behavior from kuik in version 1.6.0

Since we fixed the policy it now works fine.

Thanks a lot for your support!

Nicolasgouze · Answer 15 · Mon Jun 03 2024 15:26:03 GMT+0800 (China Standard Time)

Hi @denniskern ,

Thanks a lot for the explanation provided !

It will not come shortly (we currently have other items under development) & would have not given you 100% of the rootcause in your scenario (because of the timing + because "we don't know what we don't know") but we think about working on a "diagnosis tools" that will run on kuik startup in order to check that all cluster pre-requisites are fine before kuik services really starts.