Gardener fails to create shoots with > ~80 worker pools
hown3d opened this issue · comments
How to categorize this issue?
/area scalability
/kind bug
What happened:
When attempting to create a shoot with over approximately 80 nodepools, the shoot becomes stuck in the Create Processing
state. The shoot generates an error message stating: Flow "Shoot cluster reconciliation" encountered task errors: [task "Configuring shoot worker pools" failed: retry failed with context deadline exceeded, last error: etcdserver: request is too large] Operation will be retried.
Upon investigation, it was discovered that the Worker
resource, which is created for the shoot, becomes excessively large due to each WorkerPool
containing userData necessary for machine bootstrap. This exceeds the etcd's max-request-bytes limit of 1.5MiB for the worker resource.
What you expected to happen:
The shoot should be successfully created.
How to reproduce it (as minimally and precisely as possible):
Create shoots with a large number (80-90+) of nodepools.
Proposed Solution:
Suggest replacing the userData field in the WorkerPool type with a secretReference. This approach aligns with the OperatingSystemConfig resource, which already stores its cloud_config in a secret. Refer to the OperatingSystemConfig documentation for more details.
Tasks:
- #9722
- Adapt
Worker
extensions (afterv1.95
has been released)-
provider-alicloud
: gardener/gardener-extension-provider-alicloud#727 -
provider-aws
: gardener/gardener-extension-provider-aws#961 -
provider-azure
: gardener/gardener-extension-provider-azure#868 -
provider-gcp
: gardener/gardener-extension-provider-gcp#767 -
provider-openstack
: gardener/gardener-extension-provider-openstack#776 -
provider-equinix-metal
: gardener/gardener-extension-provider-equinix-metal#314
-
- After
gardener/gardener@v1.100
has been released: Drop deprecatedUserData
field fromextensions.gardener.cloud/v1alpha1.Worker
resource
Environment:
- Gardener version: v1.82.3
- Kubernetes version (use
kubectl version
): v1.26.14 - Cloud provider or hardware configuration:
- openstack-provider v1.39.2
- coreos extension v1.21
- Others:
- etcd-druid v0.20.0
We checked a Worker
with > 80 pools and saw that we could reduce the size by ~90% when we referenced the userData
instead of putting it inline.
/assign
For now, all related PRs have been opened. Still, we have to wait until gardener/gardener@v1.100
has been released (end of July 2024) before we can finally cleanup the inlined, now deprecated .spec.userData
field from the extensions.gardener.cloud/v1alpha1.Worker
API. This is to give extensions enough time to adapt to the API change.
Before this is done, the issue cannot be closed.