enix / kube-image-keeper

kuik is a container image caching system for Kubernetes

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

"Sporadic" controller crash when Pods are in "Terminating" status

Nicolasgouze opened this issue · comments

@denniskern : Following our short conversation, can you please provide logs and as much info as possible regarding this issue ?

On our side :

  • we do not reproduce this issue (tested with 1.6 & 1.7 releases)
  • @paullaffitte checked the code.

We want to be sure we do not miss anything ...

May be related to #308

If this issue appeared with version 1.7.0, I think we can confirm that it's the same issue as #308 and thus resolved by #311. @denniskern could you confirm please?

Sorry for late response - I was on leave.

Yes exactly this happend - If you guys have a version out there where I can test #311 I would do it.

Hi @denniskern , The PR #311 (including a fix) was merged today, and will be available in next release. I will wait for your test to close this ticket.

@Nicolasgouze that we are not pass each other - I need a release to test it :-)

@Nicolasgouze @paullaffitte Would you be able to release a beta version so we're able to test that easier? I saw that you did that earlier already. Would help us in getting you the required information more quick.

Hello @spr-mweber3 , we'll release a beta version next monday. stay tuned !

I tested version v1.8.1-beta.1 and the crash of the controller still exists. The controller crash with this message:

2024-05-06T14:31:02.140Z ERROR setup problem running manager {"error": "Pod "kube-prometheus-stack-admission-create-stq7n" is invalid: spec: Forbidden: pod updates may not change fields other than spec.containers[*].image,spec.initContainers[*].image,spec.activeDeadlineSeconds,spec.tolerations (only additions to existing tolerations),spec.terminationGracePeriodSeconds (allow it to be set to 1 if it was previously negative)\n core.PodSpec{\n \t... // 6 identical fields\n \tActiveDeadlineSeconds: nil,\n \tDNSPolicy: "ClusterFirst",\n- \tNodeSelector: nil,\n+ \tNodeSelector: map[string]string{"workergroup": "wg1"},\n \tServiceAccountName: "kube-prometheus-stack-admission",\n \tAutomountServiceAccountToken: nil,\n \t... // 22 identical fields\n }\n"}

The pod kube-prometheus-stack-admission-create-stq7n is in state terminating.

It looks like we have another problem here. But it is very surprising because this error appears in the initialization step as suggests the log "setup problem running manager" and the only update that we do on pods during initialization is no-op (p.Client.Patch(context.Background(), &pod, client.RawPatch(types.JSONPatchType, []byte("[]")))). The goal is to trigger to mutating webhook on all existing pods. And in this mutating webhook we only rewrite images and add annotations, which should not be an issue either..

What version of Kubernetes are you using please?

We are using 1.27

Sorry but I cannot reproduce your issue on a cluster in version 1.27. Is there anything specific in your setup? If you could produce a minimal reproducible example it would greatly help.

Hi @denniskern , Do you have any further info to provide so that we try to reproduce & finally correct the issue.
Thanks in advance !

Hi guys @paullaffitte @Nicolasgouze

I figured out that the problem has nothing to do with the state of the pod rather then with a clusterpolicy which is handled by kyverno. In our case we have a clusterpolicy which add a NodeSelector to the pod and because of a timing problem the policy was later in place then the pod spawned. So this means when kuik wants to replace the image location the clusterpolicy wants also to update the NodeSelector and this is not allowed which leads to this error.

But something must have changed since version 1.7.0 because we don't see this behavior from kuik in version 1.6.0

Since we fixed the policy it now works fine.

Thanks a lot for your support!

Hi @denniskern ,

Thanks a lot for the explanation provided !

It will not come shortly (we currently have other items under development) & would have not given you 100% of the rootcause in your scenario (because of the timing + because "we don't know what we don't know") but we think about working on a "diagnosis tools" that will run on kuik startup in order to check that all cluster pre-requisites are fine before kuik services really starts.