Pods with no eBPF maps attached
alemuro opened this issue · comments
What happened:
Sometimes, when starting new pods they are not reachable by other pods. After some debugging I realised that:
- There is no eBPF map attached to these pods when executing
/opt/cni/bin/aws-eks-na-cli ebpf loaded-ebpfdata | grep Pod
, but they do have a mapping when everything works fine. - There is a log entry with the text "Target Pod doesn't belong to the current pod Identifier". The command
grep "Target Pod doesn't belong to the current pod Identifier:" network-policy-agent.log | sed -e "s/.*Pod ID\: //" | awk -F "\"" '{print $3}' | sort -n | uniq
returns the list of all pods that are hosted in the current instance and are not reachable from other pods (because they don't have a map). - If I open a shell in those affected pods, I realise that they can connect to internet and they have access to all IPs, even to an IP that should be filtered by a network policy attached to the namespace.
Our network policies are composed by:
- A generic NetworkPolicy that affects the whole namespace, which has the following rules:
deny
all ingress traffic by defaultallow
all egress traffic going to internet EXCEPT for an specific IP. <-- This is not filtered on the affected pods!
- A specific NetworkPolicy that is deployed with the application, which allows access from other services. <-- This is denied on the affected pods!
If we take a look to the PolicyEndpoint
resources, they look fine. Seems a problem between the controller and eBPF.
Attach logs
What you expected to happen:
- Ingress traffic should be allowed from the specificed pods.
- Egress traffic should be filtered to the IPs that uses the
except
parameter. - There should be a eBPF program attached to all pods that have Network Policies attached.
How to reproduce it (as minimally and precisely as possible):
It is random in our setup, we haven't figured it out yet how to reproduce it.
Anything else we need to know?:
Environment:
- Kubernetes version (use
kubectl version
): v1.27 - CNI Version: v1.16.0-eksbuild.1
- Network Policy Agent Version: v1.0.7
- OS (e.g:
cat /etc/os-release
): Amazon Linux v2 - Kernel (e.g.
uname -a
): 5.10.201-191.748.amzn2.x86_64
@alemuro is the problem persistent, i.e. the eBPF program never gets attached? We do have one known issue that was just fixed by #179. The short story is that if there are multiple replicas of the same pod on a node, there is a race condition where when one replica is deleted, the eBPF program for the other replica can also be deleted.
If this is a staging environment, you can try the v1.0.8-rc1
release candidate image that we just built. The official v1.0.8
image will be released in the coming weeks.
@alemuro is the problem persistent, i.e. the eBPF program never gets attached?
It is never attached. The only way of fixing it is by removing and let K8S create a new pod.
Will try the v1.0.8-rc1
version, and I will give you some feedback!
Many thanks
Got it. If v1.0.8-rc1
does not resolve the issue, you can send an email with the network policy agent logs to k8s-awscni-triage@amazon.com
, and we can dig further. Before sending the logs, enable network policy event logs (https://github.com/aws/aws-network-policy-agent?tab=readme-ov-file#enable-policy-event-logs) so the policy decisions can be logged as well
Hello, we've been testing this for the whole day and seems it is fixed.
v1.0.8 release is available - https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.16.3