aws / aws-network-policy-agent

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Pods with no eBPF maps attached

alemuro opened this issue · comments

What happened:

Sometimes, when starting new pods they are not reachable by other pods. After some debugging I realised that:

  • There is no eBPF map attached to these pods when executing /opt/cni/bin/aws-eks-na-cli ebpf loaded-ebpfdata | grep Pod, but they do have a mapping when everything works fine.
  • There is a log entry with the text "Target Pod doesn't belong to the current pod Identifier". The command grep "Target Pod doesn't belong to the current pod Identifier:" network-policy-agent.log | sed -e "s/.*Pod ID\: //" | awk -F "\"" '{print $3}' | sort -n | uniq returns the list of all pods that are hosted in the current instance and are not reachable from other pods (because they don't have a map).
  • If I open a shell in those affected pods, I realise that they can connect to internet and they have access to all IPs, even to an IP that should be filtered by a network policy attached to the namespace.

Our network policies are composed by:

  • A generic NetworkPolicy that affects the whole namespace, which has the following rules:
    • deny all ingress traffic by default
    • allow all egress traffic going to internet EXCEPT for an specific IP. <-- This is not filtered on the affected pods!
  • A specific NetworkPolicy that is deployed with the application, which allows access from other services. <-- This is denied on the affected pods!

If we take a look to the PolicyEndpoint resources, they look fine. Seems a problem between the controller and eBPF.

Attach logs

What you expected to happen:

  • Ingress traffic should be allowed from the specificed pods.
  • Egress traffic should be filtered to the IPs that uses the except parameter.
  • There should be a eBPF program attached to all pods that have Network Policies attached.

How to reproduce it (as minimally and precisely as possible):

It is random in our setup, we haven't figured it out yet how to reproduce it.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): v1.27
  • CNI Version: v1.16.0-eksbuild.1
  • Network Policy Agent Version: v1.0.7
  • OS (e.g: cat /etc/os-release): Amazon Linux v2
  • Kernel (e.g. uname -a): 5.10.201-191.748.amzn2.x86_64

@alemuro is the problem persistent, i.e. the eBPF program never gets attached? We do have one known issue that was just fixed by #179. The short story is that if there are multiple replicas of the same pod on a node, there is a race condition where when one replica is deleted, the eBPF program for the other replica can also be deleted.

If this is a staging environment, you can try the v1.0.8-rc1 release candidate image that we just built. The official v1.0.8 image will be released in the coming weeks.

@alemuro is the problem persistent, i.e. the eBPF program never gets attached?

It is never attached. The only way of fixing it is by removing and let K8S create a new pod.

Will try the v1.0.8-rc1 version, and I will give you some feedback!

Many thanks

Got it. If v1.0.8-rc1 does not resolve the issue, you can send an email with the network policy agent logs to k8s-awscni-triage@amazon.com, and we can dig further. Before sending the logs, enable network policy event logs (https://github.com/aws/aws-network-policy-agent?tab=readme-ov-file#enable-policy-event-logs) so the policy decisions can be logged as well

@alemuro - I reviewed your logs and pin paths are getting cleaned up. #179 will likely fix your issue. Do let us know once you are v1.0.8-rc1. Thanks!!

Hello, we've been testing this for the whole day and seems it is fixed.