Potentially lengthy I/O by kubelet during pod startup needs to be visible to the user

Question

Potentially lengthy I/O by kubelet during pod startup needs to be visible to the user

roy-work opened this issue 15 days ago · comments

What happened?

For example, we had a pod which had:

 securityContext:
   fsGroup: 1001
   runAsGroup: 1001
   runAsUser: 1001

This — unbeknownst to us for quite some time — led to the pod taking hours to start, with it being otherwise apparently stuck in Pending.

Now, in this particular example, the "root" cause is that the fsGroup there, combined with fsGroupChangePolicy: Always — which is the default, and unspecified by us here — causes the kubelet, prior to starting the pod's containers, to recursively chown the Pod's volumes to the GID in fsGroup. This is a potentially enormous piece of I/O, as in our case, where it took hours. (The volume in question had 2.1M files on it!)

What did you expect to happen?

A clear way to tell why a pod is not transitioning out of Pending in a timely manner. A simple event in the describe pod output for the pod would have done it.

How can we reproduce it (as minimally and precisely as possible)?

Have a volume w/ ~2M files, and an fsGroup.

Anything else we need to know?

The same thing goes for pod termination, too. For example, see this bug, which is full of people who cannot determine why a pod isn't capable of termination. Sadly, that bug was never addressed, and was dismissed with:

there are a lot of different issues being discussed in this thread. There are many possible reasons why a Pod could get stuck in terminating. We know this is a common issue with many possible root causes!

The same pod (large volume + fsGroup) is also failing to terminate cleanly; hence I stumbled into that bug. And similarly, there is no output in describe pod as to why the pod isn't making progress. Users of Kubernetes need to be able to tell what operations are pending against a pod that the kubelet is still working on, that prevent it from transitioning, e.g., to Terminated or to CreatingContainer, etc.

Cf. Azure/AKS#3865

Cf. Azure/AKS#3681

We didn't have someone who knew this particular facet of k8s by heart on hand, and thus spent years dealing with this. Azure's support spent years with this too, similarly making little to no progress. (Until that issues 3865 above finally cracked the case.) That is what makes a useful diagnostic all the more important: to clue the user in on this default behavior that is probably the right thing for most use cases, but not something the user ever asked for, and thus, not something they're going to guess.

Kubernetes version

$ kubectl version
# paste output here

Cloud provider

Azure (AKS), GCP (GKE)

OS version

GKE:

(We've since left Azure/AKS for this, mostly. But given that we have replicated the behavior on GKE, I'm pretty sure the linked bug is correct in that this is base k8s behavior, not specific to a vendor's implementation.)

# cat /etc/lsb-release
CHROMEOS_AUSERVER=https://tools.google.com/service/update2
CHROMEOS_BOARD_APPID={76E245CF-C0D0-444D-BA50-36739C18EB00}
CHROMEOS_CANARY_APPID={90F229CE-83E2-4FAF-8479-E368A34938B1}
CHROMEOS_DEVSERVER=
CHROMEOS_RELEASE_APPID={76E245CF-C0D0-444D-BA50-36739C18EB00}
CHROMEOS_RELEASE_BOARD=lakitu
CHROMEOS_RELEASE_BRANCH_NUMBER=66
CHROMEOS_RELEASE_BUILD_NUMBER=17800
CHROMEOS_RELEASE_BUILD_TYPE=Official Build
CHROMEOS_RELEASE_CHROME_MILESTONE=109
CHROMEOS_RELEASE_DESCRIPTION=17800.66.78 (Official Build) stable-channel lakitu
CHROMEOS_RELEASE_KEYSET=v10
CHROMEOS_RELEASE_NAME=Chrome OS
CHROMEOS_RELEASE_PATCH_NUMBER=78
CHROMEOS_RELEASE_TRACK=stable-channel
CHROMEOS_RELEASE_VERSION=17800.66.78
DEVICETYPE=OTHER
GOOGLE_RELEASE=17800.66.78
HWID_OVERRIDE=LAKITU DEFAULT

Install tools

Container runtime (CRI) and version (if applicable)

containerd://1.7.10

Related plugins (CNI, CSI, ...) and versions (if applicable)

Lubomir I. Ivanov commented 8 days ago

/sig node

Kubernetes Prow Robot · Answer 1 · Thu May 09 2024 05:59:48 GMT+0800 (China Standard Time)

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.