kubernetes / kubernetes

Production-Grade Container Scheduling and Management

Home Page:https://kubernetes.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Watch request for CRs costs about 10-15x more memory in k8s-apiserver than in-tree resource watches

nagygergo opened this issue · comments

What happened?

I was running some load testing related to flux. When creating 10.000 kustomization custom resources (about 1KiB), the k8s apiserver consumes about 1GiB of memory. When checking with 100.000k and 300.00k, the k8s apiserver scales linearly.
When doing the same thing for 1KiB conifgmaps, creating 10.000 resources, the k8s apiserver consumes about 100 MiB of memory.
Memory pprof for 10k kustomizations:
10k kustomizations

Memory pprof for 10k configmaps:
image

The memory usage stays the same as long as the resources are existing. After looking a bit into what might force this, it seems that the kube-controller-manager sets up a watch for the kustomizations/configmaps resources.

{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"a4187168-a301-4fd1-907a-09da8fc3b587","stage":"RequestReceived","requestURI":"/apis/kustomize.toolkit.fluxcd.io/v1/kustomizations?allowWatchBookmarks=true\u0026resourceVersion=727\u0026timeout=5m37s\u0026timeoutSeconds=337\u0026watch=true","verb":"watch","user":{"username":"system:kube-controller-manager","groups":["system:authenticated"]},"sourceIPs":["172.18.0.2"],"userAgent":"kube-controller-manager/v1.29.2 (linux/amd64) kubernetes/4b8e819/metadata-informers","objectRef":{"resource":"kustomizations","apiGroup":"kustomize.toolkit.fluxcd.io","apiVersion":"v1"},"requestReceivedTimestamp":"2024-04-29T14:33:32.979279Z","stageTimestamp":"2024-04-29T14:33:32.979279Z"}

This is needed because garbage collector that runs in kube-controller-manager needs to walk the ownership reference map, and it wants to do that in cache:

if err := gc.resyncMonitors(logger, newResources); err != nil {

What did you expect to happen?

Expectation would've been that there is similar memory usage for in-tree and custom resources.

Also, the current garbage collector seems to force k8s-apiserver to cache the full contents of etcd. Is that a correct implementation?

How can we reproduce it (as minimally and precisely as possible)?

  1. Create a cluster
    kind create cluster

  2. Add the kustomize CRD
    curl -L https://raw.githubusercontent.com/fluxcd/kustomize-controller/main/config/crd/bases/kustomize.toolkit.fluxcd.io_kustomizations.yaml | kubectl apply -f -

  3. Create 10k of the following

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: podinfo
spec:
  interval: 10m
  targetNamespace: default
  sourceRef:
    kind: GitRepository
    name: podinfo
  path: "./kustomize"
  prune: true
  timeout: 1m
  patches:
  - patch: |-
      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: not-used
      spec:
        template:
          metadata:
            annotations:
              cluster-autoscaler.kubernetes.io/safe-to-evict: "true"        
    target:
      kind: Deployment
      labelSelector: "app.kubernetes.io/part-of=my-app"
  - patch: |
      - op: add
        path: /spec/template/spec/securityContext
        value:
          runAsUser: 10000
          fsGroup: 1337
      - op: add
        path: /spec/template/spec/containers/0/securityContext
        value:
          readOnlyRootFilesystem: true
          allowPrivilegeEscalation: false
          runAsNonRoot: true
          capabilities:
            drop:
              - ALL        
    target:
      kind: Deployment
      name: podinfo
      namespace: apps

Anything else we need to know?

Kubernetes version

$ kubectl version
# paste output here

Cloud provider

kind

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

/sig apimachinery

@nagygergo: The label(s) sig/apimachinery cannot be applied, because the repository doesn't have them.

In response to this:

/sig apimachinery

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

/sig api-machinery