Pods with no eBPF maps attached

Question

Pods with no eBPF maps attached

alemuro opened this issue 6 months ago · comments

Aleix Murtra commented 6 months ago

What happened:

Sometimes, when starting new pods they are not reachable by other pods. After some debugging I realised that:

There is no eBPF map attached to these pods when executing /opt/cni/bin/aws-eks-na-cli ebpf loaded-ebpfdata | grep Pod, but they do have a mapping when everything works fine.
There is a log entry with the text "Target Pod doesn't belong to the current pod Identifier". The command grep "Target Pod doesn't belong to the current pod Identifier:" network-policy-agent.log | sed -e "s/.*Pod ID\: //" | awk -F "\"" '{print $3}' | sort -n | uniq returns the list of all pods that are hosted in the current instance and are not reachable from other pods (because they don't have a map).
If I open a shell in those affected pods, I realise that they can connect to internet and they have access to all IPs, even to an IP that should be filtered by a network policy attached to the namespace.

Our network policies are composed by:

A generic NetworkPolicy that affects the whole namespace, which has the following rules:
- deny all ingress traffic by default
- allow all egress traffic going to internet EXCEPT for an specific IP. <-- This is not filtered on the affected pods!
A specific NetworkPolicy that is deployed with the application, which allows access from other services. <-- This is denied on the affected pods!

If we take a look to the PolicyEndpoint resources, they look fine. Seems a problem between the controller and eBPF.

Attach logs

What you expected to happen:

Ingress traffic should be allowed from the specificed pods.
Egress traffic should be filtered to the IPs that uses the except parameter.
There should be a eBPF program attached to all pods that have Network Policies attached.

How to reproduce it (as minimally and precisely as possible):

It is random in our setup, we haven't figured it out yet how to reproduce it.

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): v1.27
CNI Version: v1.16.0-eksbuild.1
Network Policy Agent Version: v1.0.7
OS (e.g: cat /etc/os-release): Amazon Linux v2
Kernel (e.g. uname -a): 5.10.201-191.748.amzn2.x86_64

Jeffrey Nelson · Answer 1 · Wed Jan 10 2024 00:31:19 GMT+0800 (China Standard Time)

@alemuro is the problem persistent, i.e. the eBPF program never gets attached? We do have one known issue that was just fixed by #179. The short story is that if there are multiple replicas of the same pod on a node, there is a race condition where when one replica is deleted, the eBPF program for the other replica can also be deleted.

If this is a staging environment, you can try the v1.0.8-rc1 release candidate image that we just built. The official v1.0.8 image will be released in the coming weeks.

Aleix Murtra · Answer 2 · Wed Jan 10 2024 00:42:22 GMT+0800 (China Standard Time)

@alemuro is the problem persistent, i.e. the eBPF program never gets attached?

It is never attached. The only way of fixing it is by removing and let K8S create a new pod.

Will try the v1.0.8-rc1 version, and I will give you some feedback!

Many thanks

Jeffrey Nelson · Answer 3 · Wed Jan 10 2024 00:45:32 GMT+0800 (China Standard Time)

Got it. If v1.0.8-rc1 does not resolve the issue, you can send an email with the network policy agent logs to k8s-awscni-triage@amazon.com, and we can dig further. Before sending the logs, enable network policy event logs (https://github.com/aws/aws-network-policy-agent?tab=readme-ov-file#enable-policy-event-logs) so the policy decisions can be logged as well

Jayanth Varavani · Answer 4 · Wed Jan 10 2024 02:30:49 GMT+0800 (China Standard Time)

@alemuro - I reviewed your logs and pin paths are getting cleaned up. #179 will likely fix your issue. Do let us know once you are v1.0.8-rc1. Thanks!!

Aleix Murtra · Answer 5 · Thu Jan 11 2024 00:42:09 GMT+0800 (China Standard Time)

Hello, we've been testing this for the whole day and seems it is fixed.

Jayanth Varavani · Answer 6 · Wed Feb 21 2024 01:55:43 GMT+0800 (China Standard Time)

v1.0.8 release is available - https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.16.3