gardener / gardener

Homogeneous Kubernetes clusters at scale on any infrastructure using hosted control planes.

Home Page:https://gardener.cloud

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Gardener Resource Reservation Proposal

MichaelEischer opened this issue · comments

This issue contains a detailed proposal on how to configure better default resource reservations as proposed in #2590 .
The goal of better resource reservations is to ensure that workloads on a node cannot cause the node to freeze by overloading it.
By setting proper defaults, nodes are reliable without requiring further configuration.

How to categorize this issue?

/area robustness
/kind enhancement

What would you like to be added:

The following proposal describes how the task "Configure sensible default values for kube-reserved and system-reserved based on the worker pool's machine size, similar to GKE's formula for resource reservation" from #2590 could be implemented. We suggest to use GKE's formula to calculate the default resource reservations. This results in resource reservations that are dependent on the machine size, in particular the number of CPU cores and available memory.

  • Keep new resource reservations calculation behind feature gate
  • Remove static defaults for v1beta1.Shoot fields Kubernetes.Kubelet.{KubeReserved,SystemReserved}. Existing shoots will keep the old default resource reservations, which allows switching to the new defaults in a controlled manner. The proper resource reservation values are MachineType-specific and therefore a value set at Shoot level is not specific enough.
  • Extend the CloudProfile to optionally allow specifying resource reservations for each machine type. Add fields KubeReserved and SystemReserved (equivalent to those used in the Shoot spec) to the v1beta1.MachineType struct.
    GKE's formula for resource reservations is probably not ideal for every use case. Thus provide an option for Gardener operators to overwrite the value based on the machine type.
  • Update the operating system config calculation in operatingSystemConfig.newDeployer such that the resource reservations are determined as follows:
    • use the first applicable (most specific) resource reservation setting:
      1. worker-specific resource reservations,
      2. resource reservations from Shoot spec,
      3. resource reservations explicitly specified for MachineType in CloudProfile,
      4. GKE formula based on CPU/memory values in CloudProfile
    • Resulting behavior: the GKE formula only applies to newly created shoots by default. Existing shoots use the new defaults once the default from the Shoot spec is removed.
    • Potential Problem: changes to the CloudProfile might propagate to the shoots in an uncontrolled manner.
      • Currently, all worker nodes would quickly apply the new resource reservations without further coordination.
      • After implementing the "Rolling update of the worker pool when critical kubelet configuration changed" task from #2590 this would trigger an extensive node roll.

Alternative implementation:

  • Remove static defaults for v1beta1.Shoot fields Kubernetes.Kubelet.{KubeReserved,SystemReserved}.
  • Use an optional admission plugin that for each Worker in a Shoot object injects Kubernetes.Kubelet.{KubeReserved,SystemReserved} based on the MachineType-specific values from the CloudProfile or as fallback the GKE formula based on the CPU/memory values from the Cloudprofile. The value is only injected if Shoot.Spec.Kubernetes.Kubelet.{KubeReserved,SystemReserved} at the Shoot spec level is not set.
    • Existing shoots will not be modified by the admission plugin, as Shoot.Spec.Kubernetes.Kubelet.{KubeReserved,SystemReserved} is set for them.
    • This avoids the problem that the resource reservation can unexpectedly change if the CloudProfile is modified.
    • However, now the MachineType and the resource reservations of a Worker have to be modified together. There is no easy way to distinguish between reservations that are injected defaults or that were explicitly set by a user.
  • Extend the CloudProfile as described above, but keep the current implementation of operatingSystemConfig.newDeployer.

The alternative approach has a similar effect as the first approach, with the difference that the CloudProfile based reservation values are already determined when creating the Shoot object.

Why is this needed:

We want to provide sensible default resource reservations and also allow operators to overwrite then on a MachineType basis.

First of all, thanks @MichaelEischer for picking this topic up. I suggest to only keep one issue - so maybe you can add the relevant parts from #2590 to this one and then close it.

To the proposals: I think it's better to explicitly write the values into the Shoot specification and use them from there, i.e., I prefer option 2. The reason for this is what you explained above: When computing this on the fly/during reconciliation, changed values in CloudProfiles might have a negative effect.

To proposal 2: We would still enable this admission plugin by default, right?

However, now the MachineType and the resource reservations of a Worker have to be modified together. There is no easy way to distinguish between reservations that are injected defaults or that were explicitly set by a user.

Can you explain this a bit more? What do you exactly mean by it/what is the disadvantage you'd like to point out?

@danielfoehrKn did an extensive analysis and the GKE formula is very generously doling out resources. This leads to very wasteful Kubernetes.Kubelet.{KubeReserved,SystemReserved}. Yet, in some few cases, even those were insufficient. This is why he suggested a dynamic approach (the formula would waste away resources most of the time, while still result in stuck nodes in some few cases). The other alternative is what we already had, explicit values.

I think, @MichaelEischer wanted to raise the point that whether a human or the plugin changes the worker cannot be distinguished. Even if the plugin is implemented in the described way (Kubernetes.Kubelet.{KubeReserved,SystemReserved} are only set if they aren't set yet), a problem may arise when the machine type, which is mutable, is changed. Then, it becomes unclear whether the machine type was changed and Kubernetes.Kubelet.{KubeReserved,SystemReserved} should be adapted or the user wanted other Kubernetes.Kubelet.{KubeReserved,SystemReserved} for this machine type.

To proposal 2: We would still enable this admission plugin by default, right?

For the start, it would probably be a good idea to leave the plugin disabled by default until there's some experience with it. Depending on those results it's possible to decide whether to enable it by default or make this plugin opt-in.

This is why he suggested a dynamic approach (the formula would waste away resources most of the time, while still result in stuck nodes in some few cases).

To provide a bit more context for the proposal, we also performed several tests during which the GKE formula turned out to work fairly well in preventing node freezes. During our tests, we used a test setup that managed to reliably freeze nodes within seconds (using the default reservations), thus leaving no time for dynamic adjustments of the resource reservations. Below I've sketched that test setup along with some analysis on what seems to be the underlying problem. The test setup is a worst case scenario, but a workload that suddenly sees a memory usage spike on an already loaded node should be enough to trigger the same problem.

We've conducted several tests regarding frozen nodes, and the most reliable way to trigger a node freeze turned out to be a workload that saturates both CPU and memory. For that we used a workload generator for Prometheus (just the workload generator, without Prometheus). In the following configuration, each instance consumes about 300-400MB. The memory usage can be scaled roughly linearly by adjusting the --series-count parameter. All pods together also saturate the CPU.

The following deployment pins 50 such pods to a single node and freezes the node within seconds. The worker nodes had 8 cores and 16GB RAM. They were using the default gardener resource reservations (70m CPU, 1000Mi Memory and 20k PIDs). The shoot was running Kubernetes 1.26.9 with underlying Flatcar 3510.2.5 with Kernel 5.15.119-flatcar.

Test case deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: avalanche
  name: avalanche
spec:
  # 50 replicas are deadly for a node with 8 cores, 16GB RAM
  replicas: 50
  selector:
    matchLabels:
      app: avalanche
  template:
    metadata:
      annotations:
        avalanche/scrape: "true"
      labels:
        app: avalanche
    spec:
      containers:
        - image: quay.io/prometheuscommunity/avalanche:main
          # sha256:e40a6425327b2370d6afbef4d4d505533a36af7b228b3e512c0fc1326e803100
          name: avalanche
          command:
            ["/bin/avalanche", "--series-count", "200", "--value-interval", "5"]
          ports:
            - containerPort: 9001
              protocol: TCP
              name: metrics
      restartPolicy: Always
      nodeSelector:
        kubernetes.io/hostname: shoot--hostname

Using trial & error the values from the GKE formula turned out to work reliably for preventing node freezes. For the deployment, the resource reservations were sufficient to allow OOMEvictions to work.

Based on some analysis using perf, the kernel spent more than 60% of the whole CPU in shrink_slab which tries to charge memory to cgroups and reclaims memory from them (which in rather inefficient when nearly running out of memory). This part of the memory subsystem can in combination with memory cgroups (one per container) run into scalability issues. The result is that the kernel consumes most of the CPU without making progress, which as a consequence appears as a frozen node. (There appears to be work on improving that bottleneck in https://lwn.net/Articles/944145/ although the progress is rather slow).

The effect of the resource reservations is that they provide enough reserved resources to allow kubelet/containerd to evict pods when the node runs out of memory. This also seems to be the only way to avoid the node freeze problem for now.

@MichaelEischer I am not arguing for dynamic adjustments, but for the record, your arguments appear to be quite contrived. Such an actor would sit in its own cgroup and not be bothered, it could react within ms (probably wouldn't to not cause too many disruptions), utilising a node in seconds can only be done using completely unrealistic synthetic load tests and nothing else (workload almost never uses max CPU and memory just like that without first disk/network operations, slowing the ramp-up; also, having only such pods on a node is equally unrealistic), and an actor could even deschedule pods, if necessary. So, if you argue from that direction, I am not agreeing. Be it as it may, we did not really plan to implement the dynamic approach, but ruling it out with your arguments above seems wrong to me.

In general, I agree very much, that our defaults are quite low. The reason being is that most workload in this world is resource-naive (either no requests or too high requests, also/with VPA and default settings) and nodes rarely reach saturation. So, by not taking away resources that would later lack for pod requests that aren't utilised anyhow, we save some money for our users. If optimised workload is run, that really does leverage the resources, the settings can be adjusted.

I am not arguing that this is great, but I like to stress the point, that this was not done out of our ignorance. Quite the opposite. And that GKE with their formula is more generous is to be expected since GCP also offers and charges for these resources.

We discussed this topic today with @danielfoehrKn @vlerenc @dguendisch @rfranzke @MichaelEischer.

Motivation: the problem we observed

  • Seed nodes killed/unresponsive because of high memory usage of control plane components (e.g., kube-apiserver, prometheus)
    • in our setup: HVPA feature gate is disabled, kube-apiserver is not autoscaled vertically and has no limits
  • Shoot nodes killed/unresponsive because of high memory usage of workload
  • synthetic tests (also see #9105 (comment)):
    • 50 pods with prometheus avalanche on one node -> generates configurable amount of cpu/memory usage
    • generates about 20% memory load than node total capacity
    • results in pods crossing the configured kubelet eviction threshold -> memory contention with system components
      • slowly increasing memory usage doesn't trigger this scenario
      • memory usage has to grow faster than the kubelet triggers evictions
    • node freeze occurs because kernel is traversing memory to look for freeable pages before OOM-killing processes
      • effort grows with the number of cgroups in kubepods slice
      • kubelet/containerd also try to allocate memory which is needed for triggering evictions
      • -> no process can get the memory needed for resolving the situation
    • different values for system reserved (kube-reserved set to 0, doesn't matter because it is not enforced)
      • below GKE formula: node freezes within seconds
      • values of GKE formula: short node freezes can occur (few seconds), but still headroom for OOM mechanisms to kick in

What we have already done to address the problem in our setup

  • manually set resource reservations on seed nodes according to GKE formula
    • no more killed/unresponsive nodes
  • overwrite resource reservations for a few affected shoot clusters according to GKE formula
    • less killed/unresponsive nodes

What we propose and why we think it works well

  • investigation: GKE formula seems to be suitable for most clusters in our setup
  • leaves comfortable headroom for system components on most nodes
  • will prevent the cases described above (especially on nodes with few pods but high memory usage in total)
  • will prevent the cases simulated in synthetic tests
  • will not prevent cases with the maximum number of pods and high system component usage
    • this cannot be prevented by resource reservations while still leaving allocatable resources
    • the resource reservation feature is not able to catch every case
  • conclusion: GKE formula is probably the best use of the resource reservation feature
  • however, it will waste resources on nodes with few pods but high memory usage
    • generally, the configuration of reserved resources is a tradeoff between stability and resource utilization

Decision

  • we decided for approach 2: optional admission plugin (disabled by default for now)
    • sets kubeReserved per worker pool if not set (either by user or previous default)
    • systemReserved is not set, similar to GKE clusters (doesn't matter as long as it is not enforced)
  • we decided against offering the enforceNodeAllocatable field in the Shoot API as official documentation recommends not to set it unless you're absolutely sure what you're doing
    • the important thing is to enforce pods (which is the default) so that there is enough headroom for the system components
    • accordingly, configuring systemReserved doesn't make a difference -> will be removed from the shoot API
    • accordingly, aligning with the recommended cgroup setup doesn't make a difference -> will be dropped from #2590

The decisions described above will be reflected in updates to #2590.
Accordingly, we can close this issue as it was dedicated to discussing the usage of the GKE formula and the precise implementation approach.

/close

@timebertt: Closing this issue.

In response to this:

The decisions described above will be reflected in updates to #2590.
Accordingly, we can close this issue as it was dedicated to discussing the usage of the GKE formula and the precise implementation approach.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.