Rework OperatingSystemConfigKey and WorkerPoolHash to allow considering `kubeReserved`
MichaelEischer opened this issue · comments
How to categorize this issue?
/area robustness
/kind enhancement
Suggested approach for implementing the "Rolling update of the worker pool when critical kubelet configuration changed" step from #2590 .
Table of Contents
Summary
To roll worker node pools if resource reservations managed via kubeReserved
change, it becomes necessary to version the calculation of the OperatingSystemConfig key and also the WorkerPoolHash. This ensures that worker pools are only rolled when actually changing kubeReserved
and not unnecessarily once kubeReserved
starts to be considered for node rolls.
Motivation
Changes of kubeReserved
for existing clusters currently happen in-place. They are applied by restarting the kubelet on each node with the new resource reservations. This can cause immediate preemptions on already loaded nodes. In particular, PodDisruptionsBudgets are not consider, which can lead to workload disruptions. To upgrade existing workload to new node resource reservations with minimal disruptions, we want to roll the worker nodes and use the updated reservations only on new nodes. This requires rolling the worker pool and switching to a new OperatingSystemConfig (OSC), which includes the kubeReserved
value. The new OSC must use a different name to prevent already existing nodes from applying the new kubeReserved values.
#2590 introduces a new way to calculate default kubeReserved
values. Upgrading to these new resource reservations with minimal disruptions requires the previously mentioned mechanism. However, the first attempt in #9465 was unable to handle the initial rollout without disruptions.
Problem
Worker pool rolls are triggered if the WorkerPoolHash
changes. To consider new fields in the WorkerPoolHash
, the current approach is to add a new optional field to the extensionsv1alpha1.Worker
objects. The field is then only included in the WorkerPoolHash
if it is set. Thereby, a node pool roll is only triggered by using the new feature/field.
A worker pool must only be rolled if required by changed settings of the worker pool, that is it MUST NOT roll unnecessarily when upgrading the WorkerPoolHash
calculation.
The optional field approach does not work for kubeReserved
as it already has a value that may differ from the static defaults used by Gardener. Thus, kubeReserved
always has a value and including it in the WorkerPoolHash
would trigger an immediate node roll.
In addition, the OperatingSystemConfig key (OSCKey) must also change to ensure that only new workers pick up the new configuration. Currently, this requires manually keeping both the OSCKey and the WorkerPoolHash in sync, such that each change of the OSCKey also coincides with a node pool roll. Instead the WorkerPoolHash
should include an OSC-specific hash as input to trigger a node roll when the OSC key changes.
As the OSC key must also change if kubeReserved
changes, this shifts the problem of keeping the WorkerPoolHash
stable to keeping the OSC key stable.
Goals
- Extract provider-independent attributes from the
WorkerPoolHash
calculation to gardenlet. - Trigger a node roll if
kubeReserved
changes. - But do not roll all nodes when initially rolling out the new hash calculation.
Non-Goals
- Introduce a pattern that extensions can follow to handle new node roll triggers in their
providerConfig
similar tokubeReserved
.
Proposal
The central idea is to version the WorkerPoolHash
and the OSCKey
calculation. Already existing worker pools and OSCs must stick to the old hash version. If kubeReserved
changes then the worker pool should be upgraded to the new hash version. The necessary state to track the used hash version is stored in a single secret for each shoot.
As the Worker configuration and therefore the WorkerPoolHash
are tied to a specific OSC, we'll start with discussing the OSCKey
calculation and versioning.
OSCKey Hash Calculation
We propose to provide two OSCKey hash versions:
- Version 1 (current behavior): Calculate hash based on
worker.Name
,minorKubernetesVersion
,worker.CRI
andworker.Machine.Image.Name
. The resulting value must be identical to the current result. - Version 2: use format
gardener-node-agent-<worker.Name>-hash(worker.CRI, machineType, volume type+size, worker.Machine.Image.Name+Version, minorKubernetesVersion, credentialsRotationStatus, nodeLocalDNS, kubeReserved)[:16]-<suffix>
- This includes all provider-independent node roll triggers that were previously included in the
WorkerPoolHash
. - Note: image name is not included in the OSC name anymore.
- Maximum length: 61 (worker.Name limited to 15 characters, suffix is at most 8)
- This includes all provider-independent node roll triggers that were previously included in the
OSCKey Versioning
gardenlet stores a secret called pool-hashes
in the shoot namespace of the hosting seed. The secret contains the field data
, which for each pool contains the used OSCKey hash version and stores the values calculated using the current and latest OSCKey hash version supported by Gardener.
kind: Secret
metadata:
name: pool-hashes
namespace: shoot--project--shootname
labels:
persist: "true" # -> store and migrate during control plane migration
stringData:
data: |
pools:
- name: a
currentVersion: 1
hashes:
"1": fede
"2": abcd
- name: b
currentVersion: 2
hashes:
"2": dada
The secret is read by gardenlet while reconciling OSCs for a shoot and is updated before writing the updated OSCs. The secret includes an entry for each worker pool in the shoot, worker pools are matched according to their name. An individual entry is updated as follows:
- If no entry exists for a worker pool or the secret as a whole does not exist, then create a new entry that uses the latest hash version as
currentVersion
. - Calculate the current hash value for each hash version included in the
hashes
field. If any of those hash values changes, then setcurrentVersion
to the latest supported version. - Update the
hashes
field to include the calculated hash value using thecurrentVersion
and the latest version supported by Gardener. Remove hashes for other versions.
Currently, secrets with the persist
label must also be labeled with managed-by: secrets-manager
to be migrated during the control plane migration. To migrate the pool-hashes
secret, the current managed-by: secrets-manager
filter must be removed from computeSecretsToPersist.
For the initial rollout of this secret, on startup gardenlet creates pool-hashes
secrets for each shoot based on the currently existing worker pools in the shoot spec. For each worker pool, only the name
field is included and currentVersion
is set to 1
. The hashes
field is not set. The next OSC reconcile will add the missing hash values.
The rationale for the fields is as follows:
kubeReserved
is a property of each worker pool and thus must be stored at this granularity.- The
currentVersion
of the hash must be stored to prevent unnecessary changes of the OSCKey. - The previous
hashes
must be stored to allow fields that are only included in a new hash version to trigger a node roll. For example,kubeReserved
is only included in hash version 2. However, changing the value should nevertheless trigger a hash version upgrade along with a node roll. A change ofkubeReserved
can only be considered by storing the hash (or its underlying informatino) when calculated using version 2. - When introducing a new hash version, the missing
hashes
are only added during OSC reconciliation. Consequently, changes to fields that are only included in the new hash, will only trigger a node roll after the first successful OSC reconciliation. - The secret is marked with "persist" to ensure that it is migrated during a control plane migration.
WorkerPoolHash
The WorkerPool
of an extensionsv1alpha1.Worker
is extended with an oscHash
field. This field is set to the current hash value of the corresponding OSC, unless the OSC still uses hash version 1.
The WorkerPoolHash
calculation works differently depending on whether oscHash
is set or not.
oscHash
is empty: continue using the currentWorkerPoolHash
calculation.- This includes all provider-independent node roll triggers (as before).
oscHash
is set: theWorkerPoolHash
calculation only uses theoscHash
and provider-extension--specific additional fields as input. The latter have to explicitly be passed in by the extension, the raw value ofworkerPool.ProviderConfig.Raw
is no longer added to the hash.- Previously used fields like the Kubernetes minor version are already covered by the
oscHash
.
- Previously used fields like the Kubernetes minor version are already covered by the
The OSC for previously existing worker pools uses hash version 1. Thereby, the WorkerPoolHash
remains unchanged when initially rolling out this change.
apiVersion: extensions.gardener.cloud/v1alpha1
kind: Worker
metadata:
name: example
spec:
pools:
- name: "a"
oscHash: "fede" # gardener hash part (same value as OSC name hash), empty if v1 is used for OSC
kubernetesVersion: 1.28.9
machineImage:
name: coreos
version: 3815.2.1
kubeReserved:
cpu: 80m
memory: 8Gi
Removal of Legacy Hashes
Legacy hash versions can only be removed once we can guarantee that there are no more users. The only way to ensure that is by waiting until all currently supported Kubernetes versions are no longer supported by Gardener. Then it is guaranteed that a node roll has happend since introducing the new hash version and thereby the hash version of all OSCs has been upgraded.
OSCKey Label for Shoots
The shoot health checks in botanist currently have to calculate the OSCKey based on information annotated at each node. This will no longer work with the aforementioned changes. As a replacement, each node is labelled with worker.gardener.cloud/operatingsystemconfig
that contains the name of the corresponding OSC. Thereby the health checks no longer require knowledge how to calculate the OSC name/key.
The label is included in the Worker
extension object and therefore will be added to all nodes on the next reconciliation. For a smooth migration, the health check initially has to fall back to the current approach of calculating the OSCKey itself. This fallback can be removed after a transition periods of a few Gardener versions.
Alternatives
- Add a flag for each worker pool that tracks whether
kubeReserved
still uses the default value (ignored by the WorkerPoolHash calculation). This is rather ugly as it requires keeping an additional field for each worker pool. - Use the Gardener Node Agent to upgrade
kubeReserved
in-place. ChangingkubeReserved
requires a restart ofkubelet
and results in immediate preemptions of pods if not enough resources are available. Existing mechanisms like maxSurge or PDBs would be ignored. - Only include
kubeReserved
in theWorkerPoolHash
starting from K8s >= 1.30. This would take more than a year to roll out this change to all clusters.
Implementation Steps
Draft:
- Label nodes with OSCKey
- Implement the OSCKey versioning with the
pool-hashes
secret, but only implement version 1 of the hash - Implement hash version 2 along with the new
WorkerPoolHash
- Bump Gardener version in all provider extensions
cc @rfranzke @kon-angelo
This should reflect the results of our discussions :)
The Gardener project currently lacks enough active contributors to adequately respond to all issues.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale
- Mark this issue as rotten with
/lifecycle rotten
- Close this issue with
/close
/lifecycle stale