gardener / gardener

Homogeneous Kubernetes clusters at scale on any infrastructure using hosted control planes.

Home Page:https://gardener.cloud

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Consider using server-side filtering when the gardenlet watches shoot resources

istvanballok opened this issue · comments

How to categorize this issue?

/area cost scalability
/kind enhancement

What would you like to be added:

Consider using server side filtering when the gardenlet watches shoot resources.

The gardenlet in each seed appears to receive watch events for all the shoots, only to discard the majority of them on the client side. These discarded shoot watch events are those not relevant to the specific seed. This process could potentially lead to unnecessary network traffic and slow down the gardenlet's performance. Implementing server-side filtering could help mitigate these issues.

Why is this needed:

In larger Gardener installations, I've observed that the gardener-apiserver in the runtime-garden cluster generates significant network egress traffic. This traffic is dominated by cluster scope shoot watch responses and could have a cost and scalability impact.

Currently, each gardenlet requests the complete watch event stream for the shoots at cluster scope. In response, the gardener-apiserver transmits the full cluster scope shoot watch event stream to each gardenlet. This process results in the egress traffic of the gardener-apiservers scaling linearly with the number of seeds. In larger installations with numerous seeds, this could potentially lead to significant values, indicating a potential issue with scalability and network resource utilization.

Implementing server-side filtering or sharding of watch events could alleviate this issue. Each gardenlet would only receive relevant events, eliminating the need to discard irrelevant ones on the client side. As a result, the total egress traffic of the gardener-apiserver would no longer be dependent on the number of seeds.

Considerations:

Server-side filtering when watching resources can be achieved using field selectors. For example:

k get shoots --field-selector=spec.seedName=aws-ha -w

However, there are some caveats. Currently, a seed seems to be interested in shoots where spec.seedName is the given seed, i.e. shoots that are scheduled to the given seed. Additionally, in the shoot migration scenario, shoots where status.seedName is the given seed are also relevant, if neither spec.seedName nor status.seedName is nil and they differ. Expressing this in a single HTTP request might be challenging as field selectors can only be combined with a logical AND and require literals on the right-hand side. Therefore, a second watch request and a dedicated field might be necessary to watch shoots currently being migrated away from the given seed.

I couldn't determine how the controller registration would need to be adjusted so that the resulting HTTP REST request includes the fieldSelector query parameter, hence I couldn't prepare a PR.

While it may not be idiomatic to use server-side filtering with the controller runtime API, or there may be even other drawbacks, I believe it could be beneficial for cost and scalability reasons to address this linear scaling issue.

Detailed investigations

To reach the above conclusions, I conducted the following investigations.

The network activity of the gardener-apiservers can be monitored using kubelet/cadvisor container network metrics. The following Prometheus promql expression returns the egress network activity in MB/s of the gardener-apiserver pods, averaged over the past 24 hours. Values in the range of 100MB/s could incur significant network costs and might approach scalability limits. The garden-prometheus in the runtime-garden cluster captures these metrics.

avg_over_time(
  (sum(rate(container_network_transmit_bytes_total{pod=~"gardener-apiserver-.*"}[5m])) / 1024 ^ 2)[1d:]
)

To further understand the network egress, I used the apiserver response sizes metrics. These metrics can be used to compare the previous, low level network egress bandwidth usage with the application level response sizes, and also break down by resource and scope. The following promql expressions show the "application level" network egress in MB/s, averaged over the past 24 hours, and the ratio of the cluster scope shoot watch responses.

avg_over_time(
  (sum(rate(apiserver_response_sizes_sum{pod=~"gardener-apiserver-.+"}[5m])) / 1024 ^ 2)[1d:]
)

  avg_over_time(
    (
        sum(
          rate(
            apiserver_response_sizes_sum{pod=~"gardener-apiserver-.+",resource="shoots",scope="cluster",verb="WATCH"}[5m]
          )
        )
      /
        sum(rate(apiserver_response_sizes_sum{pod=~"gardener-apiserver-.+"}[5m]))
    )[1d:]
  )
*
  100

After checking these metrics in larger Gardener installations, I concluded that the network egress of the gardener-apiserver is significant and it is dominated by cluster scope shoot watch events. Although a single shoot resource is small (a few 10kB), due to the multiplication effect described above, the total network egress of the gardener-apiservers scales linearly with the number of seeds. Therefore, with many seeds, the network traffic can become significant.

To identify which component is actually watching the shoot resources, we can enable HTTP access logs for the virtual-garden-kube-apiserver pods by adding the --vmodule=httplog=3 command line flag. Then, we can process the access logs e.g. with the following Vali expression. This shows the user agents of the clients performing the shoot watch requests at cluster scope.

{container_name="kube-apiserver", pod_name=~"virtual-garden-kube-apiserver.*"}
|= "httplog.go"
|= "WATCH"
|= "v1beta1/shoots?"
| json
| line_format "{{regexReplaceAllLiteral \".*HTTP\\\" \" .log \"\"}}"
| logfmt
| line_format "{{.userAgent}} {{.verb}} {{.URI}}"

After analyzing the logs, I concluded that most of the cluster scope shoot watch requests originate from the gardenlets in the seeds, which aligns with the expectations of the Gardener architecture.

Note that the response size attribute is not part of the HTTP access logs. Therefore, to reach the conclusions above, one needs to combine metrics (to identify the top resource/scope combination) with logs (to find the user agent).

Audit logs could further be used to check the identity of the client. However, in this case, it was not necessary because the user agent was sufficiently unique for the gardenlet component.

We can enable HTTP access logs in the local setup of Gardener as well. In this setup, the KinD cluster’s Kubernetes API server acts as the virtual garden cluster so that we need to pass the vmodule flag there.

--- a/example/gardener-local/kind/cluster/templates/_kubeadm_config_patches.tpl
+++ b/example/gardener-local/kind/cluster/templates/_kubeadm_config_patches.tpl
@@ -9,6 +9,7 @@
       - gardener-apiserver.relay.svc.cluster.local
 {{- end }}
     extraArgs:
+      vmodule: httplog=3
 {{- if not .Values.gardener.controlPlane.deployed }}
       authorization-mode: RBAC,Node
 {{- else }}

This approach can be used to verify that the controller runtime abstraction is properly configured to use server-side filtering.

$ k logs kube-apiserver-gardener-local-ha-multi-zone-control-plane \
  | grep 'verb="WATCH"' | grep "v1beta1/shoots?" | grep gardenlet \
  | sed -E 's/ +/\n/g'

I0125 10:12:19.054230 1 httplog.go:132]
"HTTP"
verb="WATCH"
URI="/apis/core.gardener.cloud/v1beta1/shoots?allowWatchBookmarks=true&resourceVersion=343&timeoutSeconds=408&watch=true"
latency="6m48.008009716s"
userAgent="gardenlet/v0.0.0 (linux/amd64) kubernetes/$Format"
audit-ID="e09a77bd-9a2c-4a6a-b278-c5ea542e4c57"
srcIP="172.18.0.4:59736"
apf_pl="global-default"
apf_fs="global-default"
apf_iseats=1
apf_fseats=0
apf_additionalLatency="0s"
apf_init_latency="1.165578ms"
apf_execution_time="1.167744ms"
resp=200

The URI shows that the gardenlet is currently watching the shoots in cluster scope without server side filtering, which is indicated by the absence of the fieldSelector query parameter.

/apis/core.gardener.cloud/v1beta1/shoots?
  allowWatchBookmarks=true&
  resourceVersion=343&
  timeoutSeconds=408&
  watch=true

cc @petersutter @vpnachev @AleksandarSavchev @rickardsjp @rfranzke

Currently, a seed seems to be interested in shoots where spec.seedName is the given seed, i.e. shoots that are scheduled to the given seed. Additionally, in the shoot migration scenario, shoots where status.seedName is the given seed are also relevant, if neither spec.seedName nor status.seedName is nil and they differ. Expressing this in a single HTTP request might be challenging as field selectors can only be combined with a logical AND and require literals on the right-hand side.

Right, we would need a request with a logical OR, and this is something which is not supported by the Kubernetes API servers.

Therefore, a second watch request and a dedicated field might be necessary to watch shoots currently being migrated away from the given seed.

This is something which is not supported by the controller-runtime. We could perhaps write a custom cache.Cache implementation to start two watches, one with a field selector for .spec.seedName, and one for .status.seedName, and merge both results together into a single cache, but that's probably not straight-forward and comes with its own complexity. Alternatively, we could maybe use a separate cache with .status.seedName just for the migration scenario, but this would also increase the complexity in the respective controller code.
Generally, having two watches is definitely better than what happens two, but we'd still end up with two watches instead of only one.

In a sync with @timebertt, we found that the easiest way is to use label selectors maintained by the gardener-apiserver. For the seeds in .spec.seedName and .status.seedName, it could add seed.gardener.cloud/<name>=true labels to Shoots (and other resources like BackupEntrys as well, btw). This would allow gardenlets (and other clients) to start their watches with such label selector and effectively enable the server-side filtering.

I'll take a look and work on it.
/assign