unexpected DENY entries

Question

unexpected DENY entries

ed4wg opened this issue 7 months ago · comments

What happened:

In reviewing the logs, I see some unexpected DENY entries in the wrong direction.

In this example, pod A is connecting to pod B on port 5003. pod A only talks to pod B and not the other way around. I believe this particular traffic is coming from a liveness probe that runs every 5 minutes. The probe on pod A runs a command that makes a simple GET request to pod B via code. In the example below, you can see the first connection works, but then 5 minutes later there are a bunch of DENY statements (From pod B -> pod A) followed by an ACCEPT in the correct direction. This pattern is repeating itself over and over in the logs at the same interval.

I'm also seeing other examples of logs show up with the same issue. A DENY where the target pod shows as the SRC instead of the DEST.

I also don't see any errors in the application logs themselves. i.e. I'm not seeing any actual problems from the app running on pod A and pod B.

Example

2023-12-06T14:25:58.455-06:00   Node: ip-172-27-44-11.ec2.internal;SIP: 172.27.32.157;SPORT: 60476;DIP: 172.27.38.227;DPORT: 5003;PROTOCOL: TCP;PolicyVerdict: ACCEPT
2023-12-06T14:25:58.466-06:00   Node: ip-172-27-44-11.ec2.internal;SIP: 172.27.32.157;SPORT: 60476;DIP: 172.27.38.227;DPORT: 5003;PROTOCOL: TCP;PolicyVerdict: ACCEPT
2023-12-06T14:30:28.869-06:00   Node: ip-172-27-44-11.ec2.internal;SIP: 172.27.38.227;SPORT: 5003;DIP: 172.27.32.157;DPORT: 60476;PROTOCOL: TCP;PolicyVerdict: DENY
2023-12-06T14:30:28.879-06:00   Node: ip-172-27-44-11.ec2.internal;SIP: 172.27.38.227;SPORT: 5003;DIP: 172.27.32.157;DPORT: 60476;PROTOCOL: TCP;PolicyVerdict: DENY
2023-12-06T14:30:28.892-06:00   Node: ip-172-27-44-11.ec2.internal;SIP: 172.27.38.227;SPORT: 5003;DIP: 172.27.32.157;DPORT: 60476;PROTOCOL: TCP;PolicyVerdict: DENY
2023-12-06T14:30:29.037-06:00   Node: ip-172-27-44-11.ec2.internal;SIP: 172.27.38.227;SPORT: 5003;DIP: 172.27.32.157;DPORT: 60476;PROTOCOL: TCP;PolicyVerdict: DENY
2023-12-06T14:30:29.458-06:00   Node: ip-172-27-44-11.ec2.internal;SIP: 172.27.38.227;SPORT: 5003;DIP: 172.27.32.157;DPORT: 60476;PROTOCOL: TCP;PolicyVerdict: DENY
2023-12-06T14:30:30.303-06:00   Node: ip-172-27-44-11.ec2.internal;SIP: 172.27.38.227;SPORT: 5003;DIP: 172.27.32.157;DPORT: 60476;PROTOCOL: TCP;PolicyVerdict: DENY
2023-12-06T14:30:30.808-06:00   Node: ip-172-27-44-11.ec2.internal;SIP: 172.27.38.227;SPORT: 5003;DIP: 172.27.32.157;DPORT: 60476;PROTOCOL: TCP;PolicyVerdict: DENY
2023-12-06T14:30:31.953-06:00   Node: ip-172-27-44-11.ec2.internal;SIP: 172.27.38.227;SPORT: 5003;DIP: 172.27.32.157;DPORT: 60476;PROTOCOL: TCP;PolicyVerdict: DENY
2023-12-06T14:30:35.294-06:00   Node: ip-172-27-44-11.ec2.internal;SIP: 172.27.38.227;SPORT: 5003;DIP: 172.27.32.157;DPORT: 60476;PROTOCOL: TCP;PolicyVerdict: DENY
2023-12-06T14:30:41.937-06:00   Node: ip-172-27-44-11.ec2.internal;SIP: 172.27.38.227;SPORT: 5003;DIP: 172.27.32.157;DPORT: 60476;PROTOCOL: TCP;PolicyVerdict: DENY
2023-12-06T14:30:55.287-06:00   Node: ip-172-27-44-11.ec2.internal;SIP: 172.27.38.227;SPORT: 5003;DIP: 172.27.32.157;DPORT: 60476;PROTOCOL: TCP;PolicyVerdict: DENY
2023-12-06T14:30:58.838-06:00   Node: ip-172-27-44-11.ec2.internal;SIP: 172.27.32.157;SPORT: 60476;DIP: 172.27.38.227;DPORT: 5003;PROTOCOL: TCP;PolicyVerdict: ACCEPT
2023-12-06T14:30:58.850-06:00   Node: ip-172-27-44-11.ec2.internal;SIP: 172.27.32.157;SPORT: 60476;DIP: 172.27.38.227;DPORT: 5003;PROTOCOL: TCP;PolicyVerdict: ACCEPT

What you expected to happen:
Assuming the correct NetworkPolicy is in place to allow comms from pod A -> pod B, I do not expect to see any DENY logs where it shows Pod B -> Pod A when the connection should be Pod A -> Pod B.

How to reproduce it (as minimally and precisely as possible):
I have not been able to trigger this directly. I just see it in the logs.

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): GitVersion:"v1.28.3-eks-4f4795d"
CNI Version: v1.15.4-eksbuild.1
Network Policy Agent Version: v1.0.6-eksbuild.1
OS (e.g: cat /etc/os-release): amazon:amazon_linux:2
Kernel (e.g. uname -a): Linux ip-172-27-44-11.ec2.internal 5.10.198-187.748.amzn2.x86_64 #1 SMP Tue Oct 24 19:49:54 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Jayanth Varavani · Answer 1 · Thu Dec 07 2023 05:56:26 GMT+0800 (China Standard Time)

Thans @ed4wg . Will you be able to try this image -

<account-number>.dkr.ecr.<region>.amazonaws.com/amazon/aws-network-policy-agent:v1.0.7-rc3

Please make sure you replace the account number and region.

Eric L · Answer 2 · Thu Dec 07 2023 06:57:39 GMT+0800 (China Standard Time)

I started using it about 45 minutes ago and i'm not seeing any of those unexpected DENY logs. :)

I'll report back tomorrow after it has chance to run for longer.

Eric L · Answer 3 · Thu Dec 07 2023 19:09:45 GMT+0800 (China Standard Time)

Everything's looking good. Not a single errant DENY has showed up in the logs.

Do you have an ETA for the release?

Apurup Chevuru · Answer 4 · Thu Dec 07 2023 23:56:38 GMT+0800 (China Standard Time)

@ed4wg Thanks for validating the rc image. Official release should be available within the next 2 weeks.

Eric L · Answer 5 · Thu Dec 14 2023 06:35:47 GMT+0800 (China Standard Time)

In further testing, i noticed this problem is back. Here are the differences:

i was on a single node dev cluster before and there wasn't much workload.
now i'm on a 3 node cluster with auto-scaler. It seems to have really picked back up with the unexpected DENY logs when i put the system under load.

I saw that 1.0.7-rc4 was available and that also has the same issue.

Jayanth Varavani · Answer 6 · Thu Dec 14 2023 07:11:00 GMT+0800 (China Standard Time)

So is the behavior same as before i.e, the first connection works, but then 5 minutes later there are a bunch of DENY statements?

Can you share the node logs? You can run this script sudo bash /opt/cni/bin/aws-cni-support.sh on the node. You can mail them to k8s-awscni-triage@amazon.com. Also maybe a sample pod example to understand the flow..

Eric L · Answer 7 · Thu Dec 14 2023 22:27:56 GMT+0800 (China Standard Time)

It seems there are multiple patterns I'm observing now. I put more context in the email i sent.

Apurup Chevuru · Answer 8 · Fri Dec 15 2023 03:43:23 GMT+0800 (China Standard Time)

@ed4wg Can you check and let us know the kernel's conntrack table size in your nodes?

Aleix Murtra · Answer 9 · Thu Dec 21 2023 17:33:15 GMT+0800 (China Standard Time)

Hello!

I'm facing the same issue, do you have any news on this topic? We managed to "fix it" by removing the default rule and deploying it again...

Thanks

Jayanth Varavani · Answer 10 · Thu Jan 04 2024 10:43:45 GMT+0800 (China Standard Time)

With reg to @ed4wg issue post scaling, I found a race condition where we try detaching a probe and netlink library returns an error for "link not found" even though the ENI is present. NA assumes the link is deleted and cleans up the pinpath while ENI is still active with a filter attached. Future updates based on the NA cache assumes probe is still active but traffic will fail since filter has a program which is already deleted causing inconsistent behavior. Will provide a fix soon.

Jayanth Varavani · Answer 11 · Wed Jan 17 2024 06:49:28 GMT+0800 (China Standard Time)

We have v1.0.8-rc1 tag available if you would like to try.

Eric L · Answer 12 · Tue Jan 23 2024 23:29:16 GMT+0800 (China Standard Time)

I ran a test on a my dev cluster using v1.0.8-rc2 over the weekend and I'm still seeing unexpected DENY logs. However, before this update, i was seeing way more of these issues. It is quite a bit less, but still there. Looking at some of the deny logs, it looks to be the same as before: problems resolving dns, problems egressing to external services like s3, problems with one pod connecting to a service on another pod.

Jayanth Varavani · Answer 13 · Wed Jan 24 2024 00:59:38 GMT+0800 (China Standard Time)

Thanks for trying out. We were able to repro and have a possible root cause to the issue. Still in process of completing the tests and code review. Will cut a release candidate once reviewed.

Eric L · Answer 14 · Fri Feb 09 2024 01:25:32 GMT+0800 (China Standard Time)

Hi @jayanthvn - any update on a fix for this issue?

Jayanth Varavani · Answer 15 · Wed Feb 21 2024 02:09:37 GMT+0800 (China Standard Time)

v1.0.8 release is available - https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.16.3. Please try it out and let us know if you see any issues..

Eric L · Answer 16 · Thu Feb 22 2024 03:44:30 GMT+0800 (China Standard Time)

I'm still seeing some unexpected DENY entries. But i'm also just pointing to the v1.0.8 aws-eks-nodeagent. The overall v1.16.3 isn't available yet via the add-on. I'm still on v1.16.2. Not sure if that makes a diff here?

Jayanth Varavani · Answer 17 · Thu Feb 22 2024 04:06:12 GMT+0800 (China Standard Time)

@ed4wg - Are you available on K8s slack? Is so can you share your handle? We can get on a call and help further look into this.

Eric L · Answer 18 · Thu Feb 22 2024 04:56:20 GMT+0800 (China Standard Time)

Hi @jayanthvn - i sent an email to k8s-awscni-triage@amazon.com with the details.

Jayanth Varavani · Answer 19 · Fri Feb 23 2024 13:20:39 GMT+0800 (China Standard Time)

Just for closure here....we synced up offline and the Deny logs are expected since those are before NP is reconciled i.e, probes are attached or while the pod/service is getting terminated. There was no functionality impact seen.