BPF map entries not being removed
alemuro opened this issue · comments
What happened:
Hello, since I've enabled AWS VPC CNI Network Policies I've detected that some nodes in my EKS cluster fails randomly. After debugging a bit, I saw that the aws-eks-nodeagent
container is creating a lot of open_files/processes. This causes the node to be unresponsive after a long time (some hours), when services cannot create more files.
Attach logs
I've entered the aws-eks-nodeagent
and I saw the following logs. Apparently, seems like the container is unable to delete entries from the BPF map.
$ tail -f /var/log/aws-routed-eni/ebpf-sdk.log
{"level":"error","ts":"2023-11-07T14:09:31.711Z","caller":"conntrack/conntrack_client.go:131","msg":"unable to delete map entry and ret -1 and err no such file or directory"}
{"level":"error","ts":"2023-11-07T14:09:31.711Z","caller":"conntrack/conntrack_client.go:131","msg":"unable to delete map entry and ret -1 and err no such file or directory"}
{"level":"error","ts":"2023-11-07T14:09:31.711Z","caller":"conntrack/conntrack_client.go:131","msg":"unable to delete map entry and ret -1 and err no such file or directory"}
{"level":"error","ts":"2023-11-07T14:09:31.711Z","caller":"conntrack/conntrack_client.go:131","msg":"unable to delete map entry and ret -1 and err no such file or directory"}
{"level":"error","ts":"2023-11-07T14:09:31.711Z","caller":"conntrack/conntrack_client.go:131","msg":"unable to delete map entry and ret -1 and err no such file or directory"}
{"level":"error","ts":"2023-11-07T14:09:31.711Z","caller":"conntrack/conntrack_client.go:131","msg":"unable to delete map entry and ret -1 and err no such file or directory"}
{"level":"error","ts":"2023-11-07T14:09:31.711Z","caller":"conntrack/conntrack_client.go:131","msg":"unable to delete map entry and ret -1 and err no such file or directory"}
{"level":"error","ts":"2023-11-07T14:09:31.711Z","caller":"conntrack/conntrack_client.go:131","msg":"unable to delete map entry and ret -1 and err no such file or directory"}
{"level":"error","ts":"2023-11-07T14:09:31.711Z","caller":"conntrack/conntrack_client.go:131","msg":"unable to delete map entry and ret -1 and err no such file or directory"}
{"level":"error","ts":"2023-11-07T14:09:31.711Z","caller":"conntrack/conntrack_client.go:131","msg":"unable to delete map entry and ret -1 and err no such file or directory"}
What you expected to happen:
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
This an "empty" cluster with one application and the following components:
- Karpenter
- EFS CSI driver
There are monitoring and ingress tools as well.
Environment:
- Kubernetes version (use
kubectl version
): v1.27.7 - CNI Version: v1.15.3
- Network Policy Agent Version: v1.0.5
- OS (e.g:
cat /etc/os-release
): Amazon Linux 2 - Kernel (e.g.
uname -a
): 5.10.197-186.748.amzn2.x86_64
@alemuro - Can you please confirm if the enable-policy-event-logs is disabled or enabled? Do you still have the node? if so can you collect node logs via /opt/cni/bin/aws-cni-support.sh
, o/p from bpftool map show
and mail them to k8s-awscni-triage@amazon.com
Nvm, we were able to repro the issue and have a possible fix. Right now mitigation would be to use v1.15.1 i.e, with agent version v1.0.4.
Hello @jayanthvn , I've sent the output of the aws-cni-support.sh
script. Unfortunately, the instance is not live anymore so I cannot run the bpftool map show
command.
I will try to downgrade to 1.15.1 and see if that fixes the issue. Thanks!