aws / aws-network-policy-agent

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Container keeps running OOM

GillesACAGroup opened this issue · comments

What happened:

I installed the VPC CNI Plugin v1.15.3-eksbuild.1 via AWS EKS onto an 1.28 EKS cluster. We configured this plugin via the following configuration:

{"enableNetworkPolicy":"true","resources":{"limits":{"memory":"96Mi"},"requests":{"memory":"96Mi"}}}
We quickly noticed the aws-node pods were going OOM (Exit code 137). It is a cluster with 6 nodes so we figured it needed more memory. After increasing it to 128Mi it was having the same issue. Eventually we increased it to 1Gi but then we noticed the container was no longer crashing but the aws-eks-nodeagent started giving a ton of logs after the container reached 500Mi in memory usage:
SCR-20240216-nqmf

$ kubectl logs aws-node-jmltf -n kube-system -c aws-eks-nodeagent {"level":"info","ts":"2024-02-13T09:38:31.036Z","caller":"runtime/asm_amd64.s:1650","msg":"version","GitVersion":"","GitCommit":"","BuildDate":""} 2024-02-13 09:38:31.13038271 +0000 UTC Logger.check error: failed to get caller E0213 12:59:26.839051 1 token_source.go:185] Unable to rotate token: failed to read token file "/var/run/secrets/kubernetes.io/serviceaccount/token": open /var/run/secrets/kubernetes.io/serviceaccount/token: too many open files E0213 13:05:50.842176 1 token_source.go:185] Unable to rotate token: failed to read token file "/var/run/secrets/kubernetes.io/serviceaccount/token": open /var/run/secrets/kubernetes.io/serviceaccount/token: too many open files E0213 13:15:04.845861 1 token_source.go:185] Unable to rotate token: failed to read token file "/var/run/secrets/kubernetes.io/serviceaccount/token": open /var/run/secrets/kubernetes.io/serviceaccount/token: too many open files E0213 13:23:43.848835 1 token_source.go:185] Unable to rotate token: failed to read token file "/var/run/secrets/kubernetes.io/serviceaccount/token": open /var/run/secrets/kubernetes.io/serviceaccount/token: too many open files E0213 13:33:23.851882 1 token_source.go:185] Unable to rotate token: failed to read token file "/var/run/secrets/kubernetes.io/serviceaccount/token": open /var/run/secrets/kubernetes.io/serviceaccount/token: too many open files E0213 13:40:54.855348 1 token_source.go:185] Unable to rotate token: failed to read token file "/var/run/secrets/kubernetes.io/serviceaccount/token": open /var/run/secrets/kubernetes.io/serviceaccount/token: too many open files ....

Logs from the other containers:
$ kubectl logs aws-node-jmltf -n kube-system -c aws-node time="2024-02-13T09:38:30Z" level=info msg="Starting IPAM daemon... " Installed /host/opt/cni/bin/aws-cni Installed /host/opt/cni/bin/egress-cni time="2024-02-13T09:38:30Z" level=info msg="Checking for IPAM connectivity... " time="2024-02-13T09:38:32Z" level=info msg="Copying config file... " time="2024-02-13T09:38:32Z" level=info msg="Successfully copied CNI plugin binary and config file."

kubectl logs aws-node-jmltf -n kube-system -c aws-vpc-cni-init time="2024-02-13T09:38:30Z" level=info msg="Copying CNI plugin binaries ..." time="2024-02-13T09:38:30Z" level=info msg="Copied all CNI plugin binaries to /host/opt/cni/bin" time="2024-02-13T09:38:30Z" level=info msg="Found primaryMAC 02:b3:1f:ba:b5:11" time="2024-02-13T09:38:30Z" level=info msg="Found primaryIF eth0" time="2024-02-13T09:38:30Z" level=info msg="Updated net/ipv4/conf/eth0/rp_filter to 2\n" time="2024-02-13T09:38:30Z" level=info msg="Updated net/ipv4/tcp_early_demux to 1\n" time="2024-02-13T09:38:30Z" level=info msg="CNI init container done"

Anything else we need to know?:
We run clusters for other customers with the exact same version of plugins / ami for the worker nodes and these all run on 96Mi. This is the only cluster we're having this issue on.

Environment:

  • Kubernetes version (use kubectl version): 1.28
  • CNI Version v1.15.3
  • Network Policy Agent Version v1.0.5
  • OS (e.g: cat /etc/os-release): Bottlerocket OS 1.16.0 (aws-k8s-1.28)
  • Kernel (e.g. uname -a): Linux 6.1.55

@GillesACAGroup Known issue with v.15.3 that results in too many open files error you're seeing above. Please upgrade to the latest version of VPC CNI.

Issue was indeed fixed with upgrading to v1.15.4. Memory usage is now back to normal and error logs are gone.