Make the `kubernetes.kubelet.cpuManagerPolicy` field immutable
ialidzhikov opened this issue · comments
How to categorize this issue?
/area quality
/kind bug
What happened:
With @adenitiu and @nickytd we discovered that changing the kubernetes.kubelet.cpuManagerPolicy
field breaks kubelet.
Afterwards it cannot start successfully with the error logs:
E0227 12:31:16.556082 73381 kubelet.go:1466] "Failed to start ContainerManager" err="start cpu manager error: could not restore state from checkpoint: configured policy \"none\" differs from state checkpoint policy \"static\", please drain this node and delete the CPU manager checkpoint file \"/var/lib/kubelet/cpu_manager_state\" before restarting Kubelet"
E0227 12:31:16.556064 73381 cpu_manager.go:224] "Could not initialize checkpoint manager, please drain node and remove policy state file" err="could not restore state from checkpoint: configured policy \"none\" differs from state checkpoint policy \"static\", please drain this node and delete the CPU manager checkpoint file \"/var/lib/kubelet/cpu_manager_state\" before restarting Kubelet"
Node events that prove that the kubelet gets constantly restarted:
% k describe no shoot--foo--bar-worker-z2-5544c-sd4nh
Normal Starting 3m2s kubelet Starting kubelet.
Normal Starting 2m57s kubelet Starting kubelet.
Normal Starting 2m51s kubelet Starting kubelet.
Normal Starting 2m46s kubelet Starting kubelet.
Normal Starting 2m40s kubelet Starting kubelet.
Warning InvalidDiskCapacity 2m40s kubelet invalid capacity 0 on image filesystem
Warning kubelet 2m39s (x78 over 122m) healthcheck Kubelet is unhealthy for more than 1m0s, restarting it. Health check error: Get "http://127.0.0.1:10248/healthz": dial tcp 127.0.0.1:10248: connect: connection refused
Normal Starting 2m39s kubelet Starting kubelet.
Normal Starting 2m33s kubelet Starting kubelet.
Normal Starting 2m27s kubelet Starting kubelet.
Warning InvalidDiskCapacity 2m27s kubelet invalid capacity 0 on image filesystem
Normal Starting 2m22s kubelet Starting kubelet.
Normal Starting 2m16s kubelet Starting kubelet.
Normal Starting 2m11s kubelet Starting kubelet.
Normal Starting 2m5s kubelet Starting kubelet.
Normal Starting 2m kubelet Starting kubelet.
Normal Starting 114s kubelet Starting kubelet.
Normal Starting 109s kubelet Starting kubelet.
Warning InvalidDiskCapacity 109s kubelet invalid capacity 0 on image filesystem
Normal Starting 103s kubelet Starting kubelet.
Warning InvalidDiskCapacity 103s kubelet invalid capacity 0 on image filesystem
Normal Starting 98s kubelet Starting kubelet.
Warning InvalidDiskCapacity 98s kubelet invalid capacity 0 on image filesystem
Normal Starting 92s kubelet Starting kubelet.
Warning InvalidDiskCapacity 92s kubelet invalid capacity 0 on image filesystem
Normal Starting 87s kubelet Starting kubelet.
Warning InvalidDiskCapacity 87s kubelet invalid capacity 0 on image filesystem
Normal Starting 81s kubelet Starting kubelet.
Normal Starting 76s kubelet Starting kubelet.
Normal Starting 70s kubelet Starting kubelet.
Warning InvalidDiskCapacity 70s kubelet invalid capacity 0 on image filesystem
Normal Starting 69s kubelet Starting kubelet.
Normal Starting 63s kubelet Starting kubelet.
Normal Starting 57s kubelet Starting kubelet.
Warning InvalidDiskCapacity 57s kubelet invalid capacity 0 on image filesystem
Warning FailedNetworkChecks 55s (x27 over 101m) network-problem-detector-host host network problems for jobID/destination combinations: tcp-n2p/shoot--hc-cc-us1--prod-cc-haas-hana-z1-549ff-7wlvp
Normal Starting 52s kubelet Starting kubelet.
Normal Starting 46s kubelet Starting kubelet.
Normal Starting 41s kubelet Starting kubelet.
Warning InvalidDiskCapacity 41s kubelet invalid capacity 0 on image filesystem
Normal Starting 35s kubelet Starting kubelet.
Normal Starting 30s kubelet Starting kubelet.
Normal Starting 24s kubelet Starting kubelet.
Warning InvalidDiskCapacity 24s kubelet invalid capacity 0 on image filesystem
Normal Starting 19s kubelet Starting kubelet.
Normal Starting 13s kubelet Starting kubelet.
Warning InvalidDiskCapacity 13s kubelet invalid capacity 0 on image filesystem
Normal Starting 8s kubelet Starting kubelet.
Normal Starting 2s kubelet Starting kubelet.
What you expected to happen:
The kubernetes.kubelet.cpuManagerPolicy
field to be immutable.
How to reproduce it (as minimally and precisely as possible):
-
Create a worker pool with
kubernetes.kubelet.cpuManagerPolicy=static
. -
Change the
kubernetes.kubelet.cpuManagerPolicy
field tonone
-
Make sure that kubelet is failing to start with:
E0227 12:31:16.556082 73381 kubelet.go:1466] "Failed to start ContainerManager" err="start cpu manager error: could not restore state from checkpoint: configured policy \"none\" differs from state checkpoint policy \"static\", please drain this node and delete the CPU manager checkpoint file \"/var/lib/kubelet/cpu_manager_state\" before restarting Kubelet"
E0227 12:31:16.556064 73381 cpu_manager.go:224] "Could not initialize checkpoint manager, please drain node and remove policy state file" err="could not restore state from checkpoint: configured policy \"none\" differs from state checkpoint policy \"static\", please drain this node and delete the CPU manager checkpoint file \"/var/lib/kubelet/cpu_manager_state\" before restarting Kubelet"
Anything else we need to know?:
Environment:
- Gardener version: v1.88.0
- Kubernetes version (use
kubectl version
): - Cloud provider or hardware configuration:
- Others:
/assign
Hi @ialidzhikov, actually we still have the need to change it, make the field immutable would cause some other issue for us. A better way would be, if the worker group's node size is 0, then we can change it, or else we can't.
What about triggering a rolling update of the nodes when this field is changed?
Hi @ialidzhikov, if there is any update for this issue? Thanks!
Sorry, I won't have capacity in the next weeks to look into this issue due to other priorities.
/unassign
In case we would like to pursue the node rolling approach, I assume we have to wait for #9699 first.
cc @MichaelEischer @timebertt @kon-angelo - perhaps you want to consider this cpuManagerPolicy
field in the new "hash function" right away?