Watch request for CRs costs about 10-15x more memory in k8s-apiserver than in-tree resource watches

nagygergo opened this issue · comments

What happened?

I was running some load testing related to flux. When creating 10.000 kustomization custom resources (about 1KiB), the k8s apiserver consumes about 1GiB of memory. When checking with 100.000k and 300.00k, the k8s apiserver scales linearly.
When doing the same thing for 1KiB conifgmaps, creating 10.000 resources, the k8s apiserver consumes about 100 MiB of memory.
Memory pprof for 10k kustomizations:
10k kustomizations

Memory pprof for 10k configmaps:

The memory usage stays the same as long as the resources are existing. After looking a bit into what might force this, it seems that the kube-controller-manager sets up a watch for the kustomizations/configmaps resources.

{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"a4187168-a301-4fd1-907a-09da8fc3b587","stage":"RequestReceived","requestURI":"/apis/kustomize.toolkit.fluxcd.io/v1/kustomizations?allowWatchBookmarks=true\u0026resourceVersion=727\u0026timeout=5m37s\u0026timeoutSeconds=337\u0026watch=true","verb":"watch","user":{"username":"system:kube-controller-manager","groups":["system:authenticated"]},"sourceIPs":[""],"userAgent":"kube-controller-manager/v1.29.2 (linux/amd64) kubernetes/4b8e819/metadata-informers","objectRef":{"resource":"kustomizations","apiGroup":"kustomize.toolkit.fluxcd.io","apiVersion":"v1"},"requestReceivedTimestamp":"2024-04-29T14:33:32.979279Z","stageTimestamp":"2024-04-29T14:33:32.979279Z"}

This is needed because garbage collector that runs in kube-controller-manager needs to walk the ownership reference map, and it wants to do that in cache:

if err := gc.resyncMonitors(logger, newResources); err != nil {

What did you expect to happen?

Expectation would've been that there is similar memory usage for in-tree and custom resources.

Also, the current garbage collector seems to force k8s-apiserver to cache the full contents of etcd. Is that a correct implementation?

How can we reproduce it (as minimally and precisely as possible)?

  1. Create a cluster
    kind create cluster

  2. Add the kustomize CRD
    curl -L https://raw.githubusercontent.com/fluxcd/kustomize-controller/main/config/crd/bases/kustomize.toolkit.fluxcd.io_kustomizations.yaml | kubectl apply -f -

  3. Create 10k of the following

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
  name: podinfo
  interval: 10m
  targetNamespace: default
    kind: GitRepository
    name: podinfo
  path: "./kustomize"
  prune: true
  timeout: 1m
  - patch: |-
      apiVersion: apps/v1
      kind: Deployment
        name: not-used
              cluster-autoscaler.kubernetes.io/safe-to-evict: "true"        
      kind: Deployment
      labelSelector: "app.kubernetes.io/part-of=my-app"
  - patch: |
      - op: add
        path: /spec/template/spec/securityContext
          runAsUser: 10000
          fsGroup: 1337
      - op: add
        path: /spec/template/spec/containers/0/securityContext
          readOnlyRootFilesystem: true
          allowPrivilegeEscalation: false
          runAsNonRoot: true
              - ALL        
      kind: Deployment
      name: podinfo
      namespace: apps

Anything else we need to know?

Kubernetes version

$ kubectl version
OS version

# On Linux:
$ cat /etc/os-release
$ uname -a
# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

/sig api-machinery