aws / aws-network-policy-agent

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

BPF map entries not being removed

alemuro opened this issue · comments

What happened:

Hello, since I've enabled AWS VPC CNI Network Policies I've detected that some nodes in my EKS cluster fails randomly. After debugging a bit, I saw that the aws-eks-nodeagent container is creating a lot of open_files/processes. This causes the node to be unresponsive after a long time (some hours), when services cannot create more files.

Attach logs

I've entered the aws-eks-nodeagent and I saw the following logs. Apparently, seems like the container is unable to delete entries from the BPF map.

$ tail -f /var/log/aws-routed-eni/ebpf-sdk.log 
{"level":"error","ts":"2023-11-07T14:09:31.711Z","caller":"conntrack/conntrack_client.go:131","msg":"unable to delete map entry and ret -1 and err no such file or directory"}
{"level":"error","ts":"2023-11-07T14:09:31.711Z","caller":"conntrack/conntrack_client.go:131","msg":"unable to delete map entry and ret -1 and err no such file or directory"}
{"level":"error","ts":"2023-11-07T14:09:31.711Z","caller":"conntrack/conntrack_client.go:131","msg":"unable to delete map entry and ret -1 and err no such file or directory"}
{"level":"error","ts":"2023-11-07T14:09:31.711Z","caller":"conntrack/conntrack_client.go:131","msg":"unable to delete map entry and ret -1 and err no such file or directory"}
{"level":"error","ts":"2023-11-07T14:09:31.711Z","caller":"conntrack/conntrack_client.go:131","msg":"unable to delete map entry and ret -1 and err no such file or directory"}
{"level":"error","ts":"2023-11-07T14:09:31.711Z","caller":"conntrack/conntrack_client.go:131","msg":"unable to delete map entry and ret -1 and err no such file or directory"}
{"level":"error","ts":"2023-11-07T14:09:31.711Z","caller":"conntrack/conntrack_client.go:131","msg":"unable to delete map entry and ret -1 and err no such file or directory"}
{"level":"error","ts":"2023-11-07T14:09:31.711Z","caller":"conntrack/conntrack_client.go:131","msg":"unable to delete map entry and ret -1 and err no such file or directory"}
{"level":"error","ts":"2023-11-07T14:09:31.711Z","caller":"conntrack/conntrack_client.go:131","msg":"unable to delete map entry and ret -1 and err no such file or directory"}
{"level":"error","ts":"2023-11-07T14:09:31.711Z","caller":"conntrack/conntrack_client.go:131","msg":"unable to delete map entry and ret -1 and err no such file or directory"}

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

This an "empty" cluster with one application and the following components:

  • Karpenter
  • EFS CSI driver

There are monitoring and ingress tools as well.

Environment:

  • Kubernetes version (use kubectl version): v1.27.7
  • CNI Version: v1.15.3
  • Network Policy Agent Version: v1.0.5
  • OS (e.g: cat /etc/os-release): Amazon Linux 2
  • Kernel (e.g. uname -a): 5.10.197-186.748.amzn2.x86_64

@alemuro - Can you please confirm if the enable-policy-event-logs is disabled or enabled? Do you still have the node? if so can you collect node logs via /opt/cni/bin/aws-cni-support.sh , o/p from bpftool map show and mail them to k8s-awscni-triage@amazon.com

Nvm, we were able to repro the issue and have a possible fix. Right now mitigation would be to use v1.15.1 i.e, with agent version v1.0.4.

Hello @jayanthvn , I've sent the output of the aws-cni-support.sh script. Unfortunately, the instance is not live anymore so I cannot run the bpftool map show command.

I will try to downgrade to 1.15.1 and see if that fixes the issue. Thanks!