High memory consumption due to ConfigMap watches

Question

High memory consumption due to ConfigMap watches

jotak opened this issue 7 months ago · comments

Describe the bug
Before anything, note than I am not a ArgoCD user: I'm a developer of another OLM-based operator and, while investigating memory issues, out of curiosity I wanted to test a bunch of other operators to see who else was impacted by the same issue, and it seems ArgoCD operator is. I haven't done a deep investigation on argocd-operator in particular, if you think that this is a false-positive then I apologize for the inconvenience and you can close this issue.

The problem: I ran a simple test: installing a bunch of operators, monitoring memory consumption, created a dummy namespace and many configmaps in that namespace. On some operators, the memory consumption remained stable; on others like this one, it increased linearly with the created configmaps.

My assumption is that there is little chance that your operator actually needs to watch every configmaps (is it correct?). This is a quite common problem that has been documented here: https://sdk.operatorframework.io/docs/best-practices/designing-lean-operators/#overview :

"One of the pitfalls that many operators are failing into is that they watch resources with high cardinality like secrets possibly in all namespaces. This has a massive impact on the memory used by the controller on big clusters."

From my experience, with some customers this can count in gigabytes of overhead. And I would add that it's not only about memory usage, it's also stressing Kube API with a lot of traffic.

The article above suggests a remediation using cache configuration: if this would solve the problem for you, that's great! In case it doesn't, you might want to chime in here: kubernetes-sigs/controller-runtime#2570 . I'm proposing to add to controller-runtime more possibilities regarding cache management, but for that I would like to probe a bit the different use cases among OLM users, in order to understand if the solution that I'm suggesting would help others or not. I guess the goal is to find a solution that suits for the most of us, rather than each implementing its own custom cache management.

To Reproduce

Install the operator
Watch memory used
kubectl create namespace test
for i in {1..500}; do kubectl create cm test-cm-$i -n test --from-file=<INSERT BIG FILE HERE> ; done

Expected behavior

The memory should remain quite stable, but it increases for every configmap created.

Screenshots

Cheng Fang · Answer 1 · Tue Nov 07 2023 06:32:38 GMT+0800 (China Standard Time)

@jotak thanks for reporting it. Will evaluate it.

Cheng Fang · Answer 2 · Mon Nov 13 2023 22:43:14 GMT+0800 (China Standard Time)

This issue will be tracked in https://issues.redhat.com/browse/GITOPS-3602

Chetan Banavikalmutt · Answer 3 · Wed Jan 10 2024 21:04:58 GMT+0800 (China Standard Time)

@jotak I couldn't reproduce this issue. I followed the steps by running the operator locally using an OpenShift cluster and created close to 500 configmaps in a test namespace. The memory slightly increased initially(although negligible) but became stable soon. Just to be sure I also tried repeating the steps in an argocd namespace where the instance is running but got the same results. I wonder if there was something else that caused the memory to linearly increase in your case. Can you please share the operator and K8s version that was used for testing?

I investigated the code to check how the operator watches the configmaps. AFAIK it watches the configmaps in two cases:

If the configmap is owned by the operand. For example: Argo CD workload configmaps.
Configmaps that are used for Appset GitLab TLS. with a specific name.

Except for the above cases, the operator shouldn't watch configmaps in all namespaces. @chengfang @jaideepr97 Please feel free to correct me.

Joel Takvorian · Answer 4 · Wed Jan 10 2024 22:00:26 GMT+0800 (China Standard Time)

@chetan-rns I think this commit has fixed the issue: c8e4909#diff-2873f79a86c0d8b3335cd7731b0ecf7dd4301eb19a82ef7a1cba7589b5252261R264-R284
It was probably not there when I first tried.

PS: I confirm I don't reproduce the issue today