aws / aws-network-policy-agent

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

unexpected DENY entries

ed4wg opened this issue · comments

What happened:

In reviewing the logs, I see some unexpected DENY entries in the wrong direction.

In this example, pod A is connecting to pod B on port 5003. pod A only talks to pod B and not the other way around. I believe this particular traffic is coming from a liveness probe that runs every 5 minutes. The probe on pod A runs a command that makes a simple GET request to pod B via code. In the example below, you can see the first connection works, but then 5 minutes later there are a bunch of DENY statements (From pod B -> pod A) followed by an ACCEPT in the correct direction. This pattern is repeating itself over and over in the logs at the same interval.

I'm also seeing other examples of logs show up with the same issue. A DENY where the target pod shows as the SRC instead of the DEST.

I also don't see any errors in the application logs themselves. i.e. I'm not seeing any actual problems from the app running on pod A and pod B.

Example

2023-12-06T14:25:58.455-06:00   Node: ip-172-27-44-11.ec2.internal;SIP: 172.27.32.157;SPORT: 60476;DIP: 172.27.38.227;DPORT: 5003;PROTOCOL: TCP;PolicyVerdict: ACCEPT
2023-12-06T14:25:58.466-06:00   Node: ip-172-27-44-11.ec2.internal;SIP: 172.27.32.157;SPORT: 60476;DIP: 172.27.38.227;DPORT: 5003;PROTOCOL: TCP;PolicyVerdict: ACCEPT
2023-12-06T14:30:28.869-06:00   Node: ip-172-27-44-11.ec2.internal;SIP: 172.27.38.227;SPORT: 5003;DIP: 172.27.32.157;DPORT: 60476;PROTOCOL: TCP;PolicyVerdict: DENY
2023-12-06T14:30:28.879-06:00   Node: ip-172-27-44-11.ec2.internal;SIP: 172.27.38.227;SPORT: 5003;DIP: 172.27.32.157;DPORT: 60476;PROTOCOL: TCP;PolicyVerdict: DENY
2023-12-06T14:30:28.892-06:00   Node: ip-172-27-44-11.ec2.internal;SIP: 172.27.38.227;SPORT: 5003;DIP: 172.27.32.157;DPORT: 60476;PROTOCOL: TCP;PolicyVerdict: DENY
2023-12-06T14:30:29.037-06:00   Node: ip-172-27-44-11.ec2.internal;SIP: 172.27.38.227;SPORT: 5003;DIP: 172.27.32.157;DPORT: 60476;PROTOCOL: TCP;PolicyVerdict: DENY
2023-12-06T14:30:29.458-06:00   Node: ip-172-27-44-11.ec2.internal;SIP: 172.27.38.227;SPORT: 5003;DIP: 172.27.32.157;DPORT: 60476;PROTOCOL: TCP;PolicyVerdict: DENY
2023-12-06T14:30:30.303-06:00   Node: ip-172-27-44-11.ec2.internal;SIP: 172.27.38.227;SPORT: 5003;DIP: 172.27.32.157;DPORT: 60476;PROTOCOL: TCP;PolicyVerdict: DENY
2023-12-06T14:30:30.808-06:00   Node: ip-172-27-44-11.ec2.internal;SIP: 172.27.38.227;SPORT: 5003;DIP: 172.27.32.157;DPORT: 60476;PROTOCOL: TCP;PolicyVerdict: DENY
2023-12-06T14:30:31.953-06:00   Node: ip-172-27-44-11.ec2.internal;SIP: 172.27.38.227;SPORT: 5003;DIP: 172.27.32.157;DPORT: 60476;PROTOCOL: TCP;PolicyVerdict: DENY
2023-12-06T14:30:35.294-06:00   Node: ip-172-27-44-11.ec2.internal;SIP: 172.27.38.227;SPORT: 5003;DIP: 172.27.32.157;DPORT: 60476;PROTOCOL: TCP;PolicyVerdict: DENY
2023-12-06T14:30:41.937-06:00   Node: ip-172-27-44-11.ec2.internal;SIP: 172.27.38.227;SPORT: 5003;DIP: 172.27.32.157;DPORT: 60476;PROTOCOL: TCP;PolicyVerdict: DENY
2023-12-06T14:30:55.287-06:00   Node: ip-172-27-44-11.ec2.internal;SIP: 172.27.38.227;SPORT: 5003;DIP: 172.27.32.157;DPORT: 60476;PROTOCOL: TCP;PolicyVerdict: DENY
2023-12-06T14:30:58.838-06:00   Node: ip-172-27-44-11.ec2.internal;SIP: 172.27.32.157;SPORT: 60476;DIP: 172.27.38.227;DPORT: 5003;PROTOCOL: TCP;PolicyVerdict: ACCEPT
2023-12-06T14:30:58.850-06:00   Node: ip-172-27-44-11.ec2.internal;SIP: 172.27.32.157;SPORT: 60476;DIP: 172.27.38.227;DPORT: 5003;PROTOCOL: TCP;PolicyVerdict: ACCEPT

What you expected to happen:
Assuming the correct NetworkPolicy is in place to allow comms from pod A -> pod B, I do not expect to see any DENY logs where it shows Pod B -> Pod A when the connection should be Pod A -> Pod B.

How to reproduce it (as minimally and precisely as possible):
I have not been able to trigger this directly. I just see it in the logs.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): GitVersion:"v1.28.3-eks-4f4795d"
  • CNI Version: v1.15.4-eksbuild.1
  • Network Policy Agent Version: v1.0.6-eksbuild.1
  • OS (e.g: cat /etc/os-release): amazon:amazon_linux:2
  • Kernel (e.g. uname -a): Linux ip-172-27-44-11.ec2.internal 5.10.198-187.748.amzn2.x86_64 #1 SMP Tue Oct 24 19:49:54 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Thans @ed4wg . Will you be able to try this image -

<account-number>.dkr.ecr.<region>.amazonaws.com/amazon/aws-network-policy-agent:v1.0.7-rc3

Please make sure you replace the account number and region.

I started using it about 45 minutes ago and i'm not seeing any of those unexpected DENY logs. :)

I'll report back tomorrow after it has chance to run for longer.

Everything's looking good. Not a single errant DENY has showed up in the logs.

Do you have an ETA for the release?

@ed4wg Thanks for validating the rc image. Official release should be available within the next 2 weeks.

In further testing, i noticed this problem is back. Here are the differences:

  • i was on a single node dev cluster before and there wasn't much workload.
  • now i'm on a 3 node cluster with auto-scaler. It seems to have really picked back up with the unexpected DENY logs when i put the system under load.

I saw that 1.0.7-rc4 was available and that also has the same issue.

So is the behavior same as before i.e, the first connection works, but then 5 minutes later there are a bunch of DENY statements?

Can you share the node logs? You can run this script sudo bash /opt/cni/bin/aws-cni-support.sh on the node. You can mail them to k8s-awscni-triage@amazon.com. Also maybe a sample pod example to understand the flow..

It seems there are multiple patterns I'm observing now. I put more context in the email i sent.

@ed4wg Can you check and let us know the kernel's conntrack table size in your nodes?

Hello!

I'm facing the same issue, do you have any news on this topic? We managed to "fix it" by removing the default rule and deploying it again...

Thanks

With reg to @ed4wg issue post scaling, I found a race condition where we try detaching a probe and netlink library returns an error for "link not found" even though the ENI is present. NA assumes the link is deleted and cleans up the pinpath while ENI is still active with a filter attached. Future updates based on the NA cache assumes probe is still active but traffic will fail since filter has a program which is already deleted causing inconsistent behavior. Will provide a fix soon.

We have v1.0.8-rc1 tag available if you would like to try.

I ran a test on a my dev cluster using v1.0.8-rc2 over the weekend and I'm still seeing unexpected DENY logs. However, before this update, i was seeing way more of these issues. It is quite a bit less, but still there. Looking at some of the deny logs, it looks to be the same as before: problems resolving dns, problems egressing to external services like s3, problems with one pod connecting to a service on another pod.

Thanks for trying out. We were able to repro and have a possible root cause to the issue. Still in process of completing the tests and code review. Will cut a release candidate once reviewed.

Hi @jayanthvn - any update on a fix for this issue?

v1.0.8 release is available - https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.16.3. Please try it out and let us know if you see any issues..

I'm still seeing some unexpected DENY entries. But i'm also just pointing to the v1.0.8 aws-eks-nodeagent. The overall v1.16.3 isn't available yet via the add-on. I'm still on v1.16.2. Not sure if that makes a diff here?

@ed4wg - Are you available on K8s slack? Is so can you share your handle? We can get on a call and help further look into this.

Hi @jayanthvn - i sent an email to k8s-awscni-triage@amazon.com with the details.

Just for closure here....we synced up offline and the Deny logs are expected since those are before NP is reconciled i.e, probes are attached or while the pod/service is getting terminated. There was no functionality impact seen.