Make the `kubernetes.kubelet.cpuManagerPolicy` field immutable

Question

Make the `kubernetes.kubelet.cpuManagerPolicy` field immutable

ialidzhikov opened this issue 4 months ago · comments

Ismail Alidzhikov commented 4 months ago

How to categorize this issue?

/area quality
/kind bug

What happened:
With @adenitiu and @nickytd we discovered that changing the kubernetes.kubelet.cpuManagerPolicy field breaks kubelet.
Afterwards it cannot start successfully with the error logs:

E0227 12:31:16.556082   73381 kubelet.go:1466] "Failed to start ContainerManager" err="start cpu manager error: could not restore state from checkpoint: configured policy \"none\" differs from state checkpoint policy \"static\", please drain this node and delete the CPU manager checkpoint file \"/var/lib/kubelet/cpu_manager_state\" before restarting Kubelet"
E0227 12:31:16.556064   73381 cpu_manager.go:224] "Could not initialize checkpoint manager, please drain node and remove policy state file" err="could not restore state from checkpoint: configured policy \"none\" differs from state checkpoint policy \"static\", please drain this node and delete the CPU manager checkpoint file \"/var/lib/kubelet/cpu_manager_state\" before restarting Kubelet"

Node events that prove that the kubelet gets constantly restarted:

 % k describe no shoot--foo--bar-worker-z2-5544c-sd4nh

  Normal   Starting                 3m2s                   kubelet                        Starting kubelet.
  Normal   Starting                 2m57s                  kubelet                        Starting kubelet.
  Normal   Starting                 2m51s                  kubelet                        Starting kubelet.
  Normal   Starting                 2m46s                  kubelet                        Starting kubelet.
  Normal   Starting                 2m40s                  kubelet                        Starting kubelet.
  Warning  InvalidDiskCapacity      2m40s                  kubelet                        invalid capacity 0 on image filesystem
  Warning  kubelet                  2m39s (x78 over 122m)  healthcheck                    Kubelet is unhealthy for more than 1m0s, restarting it. Health check error: Get "http://127.0.0.1:10248/healthz": dial tcp 127.0.0.1:10248: connect: connection refused
  Normal   Starting                 2m39s                  kubelet                        Starting kubelet.
  Normal   Starting                 2m33s                  kubelet                        Starting kubelet.
  Normal   Starting                 2m27s                  kubelet                        Starting kubelet.
  Warning  InvalidDiskCapacity      2m27s                  kubelet                        invalid capacity 0 on image filesystem
  Normal   Starting                 2m22s                  kubelet                        Starting kubelet.
  Normal   Starting                 2m16s                  kubelet                        Starting kubelet.
  Normal   Starting                 2m11s                  kubelet                        Starting kubelet.
  Normal   Starting                 2m5s                   kubelet                        Starting kubelet.
  Normal   Starting                 2m                     kubelet                        Starting kubelet.
  Normal   Starting                 114s                   kubelet                        Starting kubelet.
  Normal   Starting                 109s                   kubelet                        Starting kubelet.
  Warning  InvalidDiskCapacity      109s                   kubelet                        invalid capacity 0 on image filesystem
  Normal   Starting                 103s                   kubelet                        Starting kubelet.
  Warning  InvalidDiskCapacity      103s                   kubelet                        invalid capacity 0 on image filesystem
  Normal   Starting                 98s                    kubelet                        Starting kubelet.
  Warning  InvalidDiskCapacity      98s                    kubelet                        invalid capacity 0 on image filesystem
  Normal   Starting                 92s                    kubelet                        Starting kubelet.
  Warning  InvalidDiskCapacity      92s                    kubelet                        invalid capacity 0 on image filesystem
  Normal   Starting                 87s                    kubelet                        Starting kubelet.
  Warning  InvalidDiskCapacity      87s                    kubelet                        invalid capacity 0 on image filesystem
  Normal   Starting                 81s                    kubelet                        Starting kubelet.
  Normal   Starting                 76s                    kubelet                        Starting kubelet.
  Normal   Starting                 70s                    kubelet                        Starting kubelet.
  Warning  InvalidDiskCapacity      70s                    kubelet                        invalid capacity 0 on image filesystem
  Normal   Starting                 69s                    kubelet                        Starting kubelet.
  Normal   Starting                 63s                    kubelet                        Starting kubelet.
  Normal   Starting                 57s                    kubelet                        Starting kubelet.
  Warning  InvalidDiskCapacity      57s                    kubelet                        invalid capacity 0 on image filesystem
  Warning  FailedNetworkChecks      55s (x27 over 101m)    network-problem-detector-host  host network problems for jobID/destination combinations: tcp-n2p/shoot--hc-cc-us1--prod-cc-haas-hana-z1-549ff-7wlvp
  Normal   Starting                 52s                    kubelet                        Starting kubelet.
  Normal   Starting                 46s                    kubelet                        Starting kubelet.
  Normal   Starting                 41s                    kubelet                        Starting kubelet.
  Warning  InvalidDiskCapacity      41s                    kubelet                        invalid capacity 0 on image filesystem
  Normal   Starting                 35s                    kubelet                        Starting kubelet.
  Normal   Starting                 30s                    kubelet                        Starting kubelet.
  Normal   Starting                 24s                    kubelet                        Starting kubelet.
  Warning  InvalidDiskCapacity      24s                    kubelet                        invalid capacity 0 on image filesystem
  Normal   Starting                 19s                    kubelet                        Starting kubelet.
  Normal   Starting                 13s                    kubelet                        Starting kubelet.
  Warning  InvalidDiskCapacity      13s                    kubelet                        invalid capacity 0 on image filesystem
  Normal   Starting                 8s                     kubelet                        Starting kubelet.
  Normal   Starting                 2s                     kubelet                        Starting kubelet.

What you expected to happen:
The kubernetes.kubelet.cpuManagerPolicy field to be immutable.

How to reproduce it (as minimally and precisely as possible):

Create a worker pool with kubernetes.kubelet.cpuManagerPolicy=static.
Change the kubernetes.kubelet.cpuManagerPolicy field to none
Make sure that kubelet is failing to start with:

E0227 12:31:16.556082   73381 kubelet.go:1466] "Failed to start ContainerManager" err="start cpu manager error: could not restore state from checkpoint: configured policy \"none\" differs from state checkpoint policy \"static\", please drain this node and delete the CPU manager checkpoint file \"/var/lib/kubelet/cpu_manager_state\" before restarting Kubelet"
E0227 12:31:16.556064   73381 cpu_manager.go:224] "Could not initialize checkpoint manager, please drain node and remove policy state file" err="could not restore state from checkpoint: configured policy \"none\" differs from state checkpoint policy \"static\", please drain this node and delete the CPU manager checkpoint file \"/var/lib/kubelet/cpu_manager_state\" before restarting Kubelet"

Anything else we need to know?:

Environment:

Gardener version: v1.88.0
Kubernetes version (use kubectl version):
Cloud provider or hardware configuration:
Others:

Ismail Alidzhikov commented 4 months ago

/assign

syy6 · Answer 1 · Wed Feb 28 2024 14:48:06 GMT+0800 (China Standard Time)

Hi @ialidzhikov, actually we still have the need to change it, make the field immutable would cause some other issue for us. A better way would be, if the worker group's node size is 0, then we can change it, or else we can't.

Rafael Franzke · Answer 2 · Wed Feb 28 2024 14:59:51 GMT+0800 (China Standard Time)

What about triggering a rolling update of the nodes when this field is changed?

syy6 · Answer 3 · Thu May 02 2024 14:04:43 GMT+0800 (China Standard Time)

Hi @ialidzhikov, if there is any update for this issue? Thanks!

Ismail Alidzhikov · Answer 4 · Wed May 08 2024 15:09:05 GMT+0800 (China Standard Time)

Sorry, I won't have capacity in the next weeks to look into this issue due to other priorities.

/unassign

Rafael Franzke · Answer 5 · Wed May 08 2024 15:17:25 GMT+0800 (China Standard Time)

In case we would like to pursue the node rolling approach, I assume we have to wait for #9699 first.

cc @MichaelEischer @timebertt @kon-angelo - perhaps you want to consider this cpuManagerPolicy field in the new "hash function" right away?