Random Client Read Timeouts to RDS After Enabling Network Policies via AWS CNI

Question

Random Client Read Timeouts to RDS After Enabling Network Policies via AWS CNI

ndrafahl opened this issue 6 months ago · comments

What happened:
We had the AWS CNI running in the cluster without the Network Policies being enforced by the CNI:
{ "enableNetworkPolicy": "false" }

We enabled the network policies to be enforced during a maintenance window (only change made was { "enableNetworkPolicy": "true" }), and it appeared at the time like everything was running correctly. The aws-node pods rotated out OK on the worker nodes, and the CNI addon update was marked as successful.

We noticed though, that after we made this change, we're seeing intermittent client connection read timeouts coming from containers that are running in our cluster, and RDS Postgres, that were not occurring before the change:

2024-02-02 01:49:02 UTC:<container_ip_redacted>(52205):<db_user_redacted>@<db_redacted>:[17204]:LOG:  could not receive data from client: Connection timed out
2024-02-02 01:49:02 UTC:<container_id_redacted>(52205):<db_user_redacted>@<db_redacted>:[17204]:LOG:  unexpected EOF on client connection with an open transaction

The containers come up OK, and connect to the RDS instance initially and start processing work, but at some point in time we see the error message above in the RDS Instance logs, and it appears that the query we were running from our container basically get stuck processing.

Bouncing the pod(s) affected appears to resolve the issue temporarily. The pods are typically moved to one of our other nodes, and reconnect and begin working again for roughly 5-30 minutes.

None of our network policies that are being enforced by the CNI have egress rules that are being defined - we are just enforcing a small handful of ingress rules between services that exist within the cluster itself.

Thank you!

Environment:

Kubernetes version (use kubectl version): v1.25.16-eks-8cb36c9
CNI Version: v1.15.4-eksbuild.1
Network Policy Agent Version: v1.0.6-eksbuild.1
OS (e.g: cat /etc/os-release): Amazon Linux 2
Kernel (e.g. uname -a): 5.10.199-190.747.amzn2.x86_64

Nick · Answer 1 · Mon Feb 05 2024 23:16:19 GMT+0800 (China Standard Time)

This also appears to be occurring for REST calls to outbound services.

We are seeing pretty consistently Request Timeouts (Root Cause: Net::ReadTimeout: Net::ReadTimeout from the ruby code) when making REST calls to both internal services/components within the cluster and external resources.

We're trying to narrow down if it's possible that the additional node agent that is now running on the host nodes is somehow "interfering" with network connections both internally in the cluster, and to external resources outside of the cluster. It seems that the initial connection is successful, but the connection is terminated (possibly an incorrect assumption) while the connection is open, and so the server attempts to respond to the client but cannot.

I should add, that we also did update kube-proxy to the latest supported 1.25 version before we began enforcing the network policies via the CNI.

Jeffrey Nelson · Answer 2 · Tue Feb 06 2024 00:15:32 GMT+0800 (China Standard Time)

@ndrafahl it is possible this issue was fixed by #185, which is being fixed in v1.0.8, which is targeting release in mid-Feb.

The TL;DR for that issue is that when multiple pod replicas were on the same node, deleting one of the replicas was inadvertently clearing the BPF pinpath for the other replica.

Nick · Answer 3 · Tue Feb 06 2024 00:18:46 GMT+0800 (China Standard Time)

Hey @jdn5126!

We're actually seeing this issue without any pods being deleted within the cluster.

Jeffrey Nelson · Answer 4 · Tue Feb 06 2024 00:20:14 GMT+0800 (China Standard Time)

Hey @jdn5126!

We're actually seeing this issue without any pods being deleted within the cluster.

I see, let's give the Network Policy maintainers a chance to comment here and see if they have any theories

Nick · Answer 5 · Tue Feb 06 2024 00:21:20 GMT+0800 (China Standard Time)

We're kind of following this thread over here, that someone else brought up. But the difference for us is that we're seeing the issue in pods that have been running for sometime: #186

Jayanth Varavani · Answer 6 · Tue Feb 06 2024 01:30:21 GMT+0800 (China Standard Time)

@ndrafahl - You are on agent version v1.0.6 and certain fixes went in v1.0.7 which fixed conntrack cleanup for long running connections and the issue you are mentioning looks similar. We are in the process of building v1.0.8 and should have the release by next week. If you would like, you can try the v1.0.8-rc3 release candidate image..

Nick · Answer 7 · Tue Feb 06 2024 01:52:11 GMT+0800 (China Standard Time)

Hey @jayanthvn - is there anything we could provide from our end that would make the "smoking gun" be that we just need to upgrade to v1.0.7 (via addon version 1.16.2-eksbuild.1)?

We're a bit hesitant to touch anything in this environment at this point, only because we're "kind of working" and don't want to push ourselves to "not working at all".

Is there any risk in updating to 1.16.2-eksbuild.1?

I should add we can absolutely do the same change in our lower environments, but our lower environments, on the same version of the addon and node agent, do not appear to have this issue.

Nick · Answer 8 · Tue Feb 06 2024 02:15:10 GMT+0800 (China Standard Time)

Quick follow up question, would those conntrack cleanups be related to both connections going out of the cluster (i.e. to RDS) and also connections internal to the cluster (i.e. between pods / services) and also between containers in the same pod?

Jayanth Varavani · Answer 9 · Tue Feb 06 2024 03:04:27 GMT+0800 (China Standard Time)

Yes all initial connection going out of the probes attached to the pods veth will be logged in the local conntrack table. After the initial connection, remaining packets should just match the local conntrack entry and egress out of the ENI. The local conntrack entry will expire once the kernel conntrack entry expires.

Sure you can share the node logs? You can run this script sudo bash /opt/cni/bin/aws-cni-support.sh on the node which is initiating the connection and we can take a look. You can mail them to k8s-awscni-triage@amazon.com

Nick · Answer 10 · Tue Feb 06 2024 03:06:59 GMT+0800 (China Standard Time)

Sure thing - do you want it from the host node where the pod was running, that ran into the issue with the RDS timeout? We are seeing issues internally as well, but I figured that could be a good starting point.

Jayanth Varavani · Answer 11 · Tue Feb 06 2024 03:09:14 GMT+0800 (China Standard Time)

Yes sure that would help..

Nick · Answer 12 · Tue Feb 06 2024 03:15:23 GMT+0800 (China Standard Time)

Sounds good, I sent it over with the subject "Issue aws/amazon-vpc-cni-k8s#204". Thanks for the help.

Maicon Alegre · Answer 13 · Tue Feb 06 2024 10:01:06 GMT+0800 (China Standard Time)

Thanks @ndrafahl, I sent the logs you shared via support case internally to the dev team.
They confirmed the incorrect conntrack cleanups identified on the vpc-cni-np-agent / v1.0.6 will be for
any initial connection going through the probes either ingress or egress attached to the veth.
It doesn't differentiate between pod-pod, pod-service or pod-external.
So the upgrade you did to use v1.0.7 of the vpc-cni-np-agent should fix the issue.
If you would like, you can try the v1.0.8-rc3 release candidate image by updating the image to

602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-network-policy-agent:v1.0.8-rc3

on your non prod cluster.
v1.0.8-rc3 has few optimizations and enhancements.
The team is working on the v1.0.8 release, which should be out by next week.

Nick · Answer 14 · Wed Feb 07 2024 05:10:28 GMT+0800 (China Standard Time)

Thanks @jayanthvn and @maiconrocha -

It does appear that upgrading to v1.0.7 did solve our issues with the random read timeouts we were seeing, we are considerably more stable today.

We'll be watching for the GA of the v1.0.8 release, and will go with that when it's ready.

Nick · Answer 15 · Wed Feb 14 2024 22:58:46 GMT+0800 (China Standard Time)

Hey -

I hate to reopen this issue. We've seen considerably less issues with timeouts after upgrading to the 1.0.7 version of the network policy controller, but we're still seeing sporadic timeouts occurring with that version.

I know one option is to test out the 1.0.8-rc3 image specifically, but we're a bit hesitant to try out a release candidate image in our Production environment, and it's been difficult to replicate the same issue in a non-production environment that does not have the same overall usage as production.

It sounds like 1.0.8 should have additional fixes for conntrack cleanups that it sounds like we were hitting in 1.0.6, that were mostly fixed in 1.0.7. Is that the case, and if so, is it:

Possible to see if our remaining issues would be solved by going to 1.0.8 when it releases prior to making the change? Anything in logs we could provide to show that's the case?
If so, I know it was targeted to release this week. Is that still the planned release window?

Effectively our other option is to revert using the AWS CNI to enforce network policies, and hopefully go back to a position where we were not seeing any of the network timeouts. I know @jayanthvn you provided the following instructions on how to revert the change by:

Deleting all Network Policies in the Cluster
Disabling the controller by updating the amazon-vpc-cni configmap
Set enableNetworkPolicy to false in the managed addon additional configuration options.

The question I have regarding this is, it appears to have left the node agent still running in the aws-node daemonset in my testing in another environment. Would that still actually revert the netpol agent from doing anything with conntrack entries on the host machine, assuming that's still the culprit?

Sorry for the long comment, I appreciate you guys having helped us out thus far.

Jayanth Varavani · Answer 16 · Thu Feb 15 2024 01:41:08 GMT+0800 (China Standard Time)

@ndrafahl - Yes we were able to repro the timeouts in v1.0.7 and it is fixed in v1.0.8. CNI release is in progress and we should have it probably have it end of this week or early next week. Would highly recommend to try it.

Regarding the instructions to disable yes those mentioned above are correct and network policy agent enabled or disabled won't impact any conntrack entries in the kernel.

Nick · Answer 17 · Mon Feb 19 2024 22:03:47 GMT+0800 (China Standard Time)

Hey @jayanthvn -

Sorry for the thousand questions I've had, I thought of a few over the weekend as we wait for the 1.0.8 release.

Were your repros in clusters that have no pod churn? I know there's a sporadic issue people I have seen (I believe in 1.0.7) where if pods from the same deployment exist on the same node, and are removed it can cause issues with the pod that's left (probably paraphrasing a bit here). We have no pod churn in our cluster, so I was just curious.
Is there anything on the 1.0.8 upgrade we would need to do after the upgrade? i.e. pod restarts, node replacements.
If we had to rollback, is it largely safe to rollback to previous version of the node agent?

Thanks a ton.

Jayanth Varavani · Answer 18 · Wed Feb 21 2024 01:51:47 GMT+0800 (China Standard Time)

@ndrafahl - Happy to answer. Please feel free to reach out....

Yes tried with and without pod churn. Yes please redeploy the replicas. Node replacement is not needed. Please keep us updated. v1.0.8 release is available - https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.16.3

Nick · Answer 19 · Wed Feb 21 2024 01:56:49 GMT+0800 (China Standard Time)

Does the release of the managed VPC CNI EKS Addon typically follow suit (which would then have 1.0.8 bundled in), or will that still be a bit for a release of that?

Jeffrey Nelson · Answer 20 · Wed Feb 21 2024 02:17:08 GMT+0800 (China Standard Time)

Does the release of the managed VPC CNI EKS Addon typically follow suit (which would then have 1.0.8 bundled in), or will that still be a bit for a release of that?

Yep, the VPC CNI EKS Addon release is in the pipeline. This version should be available in all regions within the next 48 hours

Jeffrey Nelson · Answer 21 · Tue Feb 27 2024 00:56:43 GMT+0800 (China Standard Time)

@ndrafahl are you still experiencing any of these issues after upgrading to Network Policy agent v1.0.8?

Nick · Answer 22 · Tue Feb 27 2024 01:01:07 GMT+0800 (China Standard Time)

@ndrafahl are you still experiencing any of these issues after upgrading to Network Policy agent v1.0.8?

It's funny you ask, I was literally talking to coworker about the change request we are planning to do this in Production.

We have not yet deployed it to Production. We had the same version of the managed addon, and the node agent, installed in our two lower environment EKS clusters and could not replicate the issue we are seeing in Production (very convenient).

The addon popped up late Friday that there was a new version to deploy, so we went ahead and upgraded it this morning in our two lower env clusters, and are doing some testing to see if anything is broken that wasn't broken before.

We then have to decide the date to deploy this to Production, and go through restarting all of our deployments in the cluster after the change. We're trying to come up with a backout plan in the event new issues arise, or the problem isn't resolved. I think our true backout plan is probably going to be to try to disable network policy enforcement via addon.

tl;dr (sorry, that was a lot to basically say...) we haven't yet deployed it in Production, but I'll be sure to follow up here when we do with the outcome.

Jeffrey Nelson · Answer 23 · Tue Feb 27 2024 01:02:45 GMT+0800 (China Standard Time)

No worries, thank you for the update, and let us know if we can help further!

albertschwarzkopf · Answer 24 · Tue Feb 27 2024 16:01:31 GMT+0800 (China Standard Time)

Unfortunately in my case, updating the vpc-cni add-on to v1.16.3-eksbuild.2 (nodeagent is running in version v1.0.8-eksbuild.1) did not help. I still see connection issues for few pods.

E.g. Keycloak-Pod cannot connect to RDS DB. Pod restart does help sometimes and sometimes not. But moving the pod to another node seems to help.

Apurup Chevuru · Answer 25 · Wed Feb 28 2024 03:42:14 GMT+0800 (China Standard Time)

@albertschwarzkopf Are you saying the node where pod lands is having an effect on policy enforcement? Does the Pod try to connect to RDS DB endpoint right after boot up? and if yes, are there any retries in place? Can you share the NP and node logs with us from the problematic node - k8s-awscni-triage@amazon.com

albertschwarzkopf · Answer 26 · Wed Feb 28 2024 15:08:24 GMT+0800 (China Standard Time)

@achevuru today the pod start after node downscaling was successful. So let me watch it the next days and I will update this issue.

albertschwarzkopf · Answer 27 · Fri Mar 01 2024 15:53:03 GMT+0800 (China Standard Time)

@achevuru I do not know why but it is running without issues since 2 days.

Nick · Answer 28 · Wed Mar 06 2024 22:40:05 GMT+0800 (China Standard Time)

Hey -

Just wanted to follow up on this, so the issue doesn't appear stale.

We are planning to apply this upgrade during a maintenance window we have coming up.

We did some additional investigating on v1.0.7 and we found that some of timeouts we were seeing do not appear to be directly correlated with the enforcement of the network policies via the netpol agent. It looks instead like a call we're making out of the cluster is randomly taking longer than 60 seconds, and so the nginx controller we're using internally in the cluster is "timing out" that connection after 60 seconds due to it's configuration.

Because of this, we delayed the upgrade to be in our regular maint. window so customer's are aware of the change.

I have a random question about the upgrade. You guys suggested to restart the replicas after the change. Originally we had planned to do this by rolling restarting all of the deployments in the cluster, but we have to do an AMI upgrade anyway during our maintenance window. Do you guys see any issue if we instead do:

Upgrade the CNI to 1.16.3
Upgrade the AMIs to the latest, which will cause all of the nodes to be drained and replaced. Effectively causing all replicas to be restarted.

Is there any cause for concern if pods don't get immediately restarted after the upgrade, basically, and they just get restarted as the nodes get replaced?

Jeffrey Nelson · Answer 29 · Thu Mar 07 2024 00:24:57 GMT+0800 (China Standard Time)

@ndrafahl that strategy sounds good to me, though please use VPC CNI v1.16.4 as v1.16.3 had a regression that could lead to high CPU consumption in some edge cases.

I do not see any cause for concern as long as the pods get restarted on any node.

Nick · Answer 30 · Thu Mar 07 2024 00:34:14 GMT+0800 (China Standard Time)

Cool, appreciate the reply and the heads up.

Srini Jasti · Answer 31 · Wed Mar 20 2024 05:23:28 GMT+0800 (China Standard Time)

@ndrafahl - Please let us know if you are still seeing an issue with v1.0.8 release

Nick · Answer 32 · Wed Mar 20 2024 05:27:02 GMT+0800 (China Standard Time)

@ndrafahl - Please let us know if you are still seeing an issue with v1.0.8 release

Will do.

Sorry for the delayed update on this.

We did the upgrade in our lower environments last week, with no new issues but we also were not seeing the issues in those environments on 1.0.7.

We are doing Production this week.

Nick · Answer 33 · Fri Mar 22 2024 21:36:17 GMT+0800 (China Standard Time)

We upgraded to v1.16.4 last night - I'll keep this open for a bit longer and then follow up and close it out after we've confirmed the timeouts have disappeared.

I did see another new version was available last night when we went to do the upgrade, but we did not go to that as we hadn't yet tested that in our lower environments. Must've been released recently :)

Nick · Answer 34 · Wed Mar 27 2024 02:42:23 GMT+0800 (China Standard Time)

It appears we are still seeing sporadic timeouts between at least pods/containers within our cluster, after upgrading to v1.16.4.

We tracked down one that occurred today. Basically the traffic flow should have been:
pod1 [ORIGIN] -> REST call -> ingress-nginx-controller -> pod2 (nginx container) -> pod2 (app container)

pod2 then makes it's own REST call (to complete the call back to pod1):
pod2 -> REST call -> ingress-nginx-controller -> pod1 (nginx container) -> pod1 (app container)

(We do proxying of calls between applications within our cluster for authentication reasons, which is why the back and forth occurs).

(We use an nginx "sidecar" container in most of our pods for TLS, which then forwards traffic on to another container in the same pod).

Unfortunately the traffic between the containers in pod1 is where the traffic never occurred. The "final" timeout was found in the nginx container's logs on pod1 that corresponded to the original request:

2024/03/26 15:54:12 [error] 18#18: *24930 upstream timed out (110: Operation timed out) while reading response header from upstream, client: NGINX_CONTROLLER_IP, server: cluster.local, request: "POST /a/path HTTP/1.1", upstream: "http://127.0.0.1:8080/app/api/v2/a/path", host: "subdomain.domain.io"
NGINX_CONTROLLER_IP - - [26/Mar/2024:15:54:12 +0000] "POST /a/path HTTP/1.1" 504 167 "-" "Ruby"

The timeouts then cascaded back to the origin.

We checked the tomcat logs for the application container in pod1, and didn't see any log for the particular POST call, so it doesn't appear that the app container in pod1 saw the traffic coming from the nginx container in the same pod.

Nick · Answer 35 · Wed Mar 27 2024 03:32:27 GMT+0800 (China Standard Time)

I did run the log-collector script on the node where that pod1 was running on, to see if anything in the logs for our timestamp would tell us if it's the CNI that is impacting the traffic between containers on the same pod. Not sure if that may be helpful to provide.

Nick · Answer 36 · Thu Mar 28 2024 02:11:19 GMT+0800 (China Standard Time)

We saw a small handful of other client timeout errors this morning. I traced down one that was effectively the same as the one in my previous comment.

Unfortunately for this one, the ingress-nginx-controller container logs had already rotated out and they weren't stored externally, so that's the one piece of the puzzle I don't have.

The traffic is again, as follows:
pod1 [ORIGIN] -> REST call -> ingress-nginx-controller -> pod2 (nginx container) -> pod2 (app container)

pod2 then makes it's own REST call (to complete the call back to pod1):
pod2 -> REST call -> ingress-nginx-controller -> pod1 (nginx container) -> pod1 (app container)

With this one it was more interesting - pod1 did get the call (both tomcat logged it, and the application logged it) and the amount of time it took to complete was well below 60 seconds (which would trigger a normal timeout for nginx).

I traced the call backwards from there, and found the timeout occurred this time between the nginx sidecar container in pod2 and the application container in pod2:

2024/03/27 08:00:00 [error] 25#25: *304040 upstream timed out (110: Operation timed out) while reading response header from upstream, client: INGRESS_NGINX_CONTROLLER_IP, server: cluster.local, request: "POST /a/path HTTP/1.1", upstream: "http://127.0.0.1:8080/a/path", host: "subdomain.domain.io"

It seems like the connection between the nginx sidecar container and the application container in the same pod may have been dropped while it was still open. Nothing within the application container suggests it saw a client timeout connecting back to pod1.

I've also grabbed the log-collector tarball for the node where pod2 was running.

Nick · Answer 37 · Thu Mar 28 2024 12:12:32 GMT+0800 (China Standard Time)

Sorry for the spam, just trying to get information on here in the event someone has ideas on how to troubleshoot the issue.

We saw yet another issue this evening of another client timeout, where it appears that the connection between the containers within the same pod was dropped while the connection was actively open.

The traffic was the same as above, where there are basically two "connections" occurring. One from pod1 -> pod2. This is kept open while another, separate, connection is made from pod2 -> pod1.

I grabbed another log collection from the node where the timeout occurred between the containers on pod2.

Apurup Chevuru · Answer 38 · Thu Mar 28 2024 13:01:13 GMT+0800 (China Standard Time)

@ndrafahl I see that you've collected node logs. Can you share them with us @ k8s-awscni-triage@amazon.com? Do you see DENY entries in the network policy agent logs?

We saw yet another issue this evening of another client timeout, where it appears that the connection between the containers within the same pod was dropped while the connection was actively open.

Not following this. Containers in the same pod are talking to each other via localhost probably? packets shouldn't even leave the pod network namespace in this case? Network policy shouldn't come in to play for this use case at all

Nick · Answer 39 · Thu Mar 28 2024 22:02:29 GMT+0800 (China Standard Time)

Hey @achevuru -

Not following this. Containers in the same pod are talking to each other via localhost probably? packets shouldn't even leave the pod network namespace in this case? Network policy shouldn't come in to play for this use case at all

Correct - that appears to be the case anyway. Let me see if I can better explain the full chain of connections for our latest timeout, with pod names and all as I think I've been poorly explaining this.

ingress-nginx-controller deployment (3 replicas) where 2 of them were involved:
ingress-nginx-controller-65c7b57b6d-5fmv8
ingress-nginx-controller-65c7b57b6d-n2sxh

core deployment (3 replicas) where 1 of them was involved:
core-5f76cc4c55-6gvrb

nginx container (to terminate SSL and forward to localhost in the same pod)
core container (application)

task deployment (5 replicas) where 2 of them were involved:
task-59fcb97fb6-zc7st

nginx container (to terminate SSL and forward to localhost in the same pod)
task container (application)

task-59fcb97fb6-nzcmj

same as above

There's basically two "connections" that occur as part of the chain. Connection 1 stays open until Connection 2 completes, and then everything rolls backwards.

The process is as follows:

CONNECTION 1

task-59fcb97fb6-zc7st makes a REST call to https://subdomain.domain.com/app/components/...
DNS record for this call needs to go to an ingress-nginx-controller pod for routing based on nginx ingresses. It makes it to ingress-nginx-controller-65c7b57b6d-5fmv8

2024/03/28 03:04:23 [warn] 438#438: *21137250 [lua] cookie.lua:146: bake(): SameSite value must be 'Strict', 'Lax' or 'None' while connecting to upstream, client: TASK_POD_IP, server: subdomain.domain.com, request: "POST /app/components/... HTTP/1.1", host: "subdomain.domain.com"

The ingress-nginx-controller pod determines it needs to go to the core deployment, and makes it to core-5f76cc4c55-6gvrb. We don't see initially the call in the nginx sidecar until it times out, but here's the timeout:

2024/03/28 03:05:23 [error] 24#24: *534004 upstream timed out (110: Operation timed out) while reading response header from upstream, client: NGINX_CONTROLLER_IP, server: cluster.local, request: "POST /app/components/... HTTP/1.1", upstream: "http://127.0.0.1:8080/SUBDOMAIN/app/components/task/...", host: "subdomain.domain.io"

NGINX_CONTROLLER_IP - user [28/Mar/2024:03:05:23 +0000] "POST /app/components/task/... HTTP/1.1" 504 167 "-" "Ruby"

The nginx sidecar container forwards the call on to the core container in the same pod.
The core container then makes another REST call, which will take us on to another destination. Connection 1 remains open here waiting for Connection 2 to complete.

CONNECTION 2

core-5f76cc4c55-6gvrb makes a REST call to https://subdomain.domain.com/task/app/... to get information so it can reply back to the call on Connection 1.
DNS record for this call needs to go to an ingress-nginx-controller pod for routing based on nginx ingresses. It makes it to ingress-nginx-controller-65c7b57b6d-n2sxh

2024/03/28 03:04:23 [warn] 307#307: *21155396 [lua] cookie.lua:146: bake(): SameSite value must be 'Strict', 'Lax' or 'None' while connecting to upstream, client: CORE_POD_IP, server: subdomain.domain.io, request: "POST /task/app/... HTTP/1.1", host: "subdomain.domain.io"

The ingress-nginx-controller pod determines it needs to go to the task deployment, and makes it to task-59fcb97fb6-nzcmj. The connection here is a 200 from the nginx sidecar container logs as it completes it less than a second.

INGRESS_NGINX_CONTROLLER_IP - - [28/Mar/2024:03:04:23 +0000] "POST /task/app/... HTTP/1.1" 200 147 "-" "Ruby"

And in the ingress-nginx-controller pod, we see that Connection 2 was successful and responded back in less than a second:

CORE_POD_IP - - [28/Mar/2024:03:04:23 +0000] "POST /task/app/... HTTP/1.1" 200 142 "-" "Ruby" 1059 0.109 [task-deployment] [] TASK_POD_IP:443 147 0.109 200 18681ceec65d3ae2a1a27bed9b4fe3e0

CONNECTION 1

At this point, the core pod has everything it needs to respond. There is no timeout seen from this applications perspective, which makes sense - Connection 2 never timed out.
The application completes it's work and "replies back" on Connection 1.
Regardless, we see the timeout from the nginx container in the core pod for the connection:

2024/03/28 03:05:23 [error] 24#24: *534004 upstream timed out (110: Operation timed out) while reading response header from upstream, client: NGINX_CONTROLLER_IP, server: cluster.local, request: "POST /app/components/... HTTP/1.1", upstream: "http://127.0.0.1:8080/SUBDOMAIN/app/components/task/...", host: "subdomain.domain.io"

NGINX_CONTROLLER_IP - user [28/Mar/2024:03:05:23 +0000] "POST /app/components/task/... HTTP/1.1" 504 167 "-" "Ruby"

We then see the timeout in the ingress-nginx-controller logs for Connection 1 as well, a minute later:

2024/03/28 03:05:23 [error] 438#438: *21137250 upstream timed out (110: Operation timed out) while reading response header from upstream, client: NGINX_CONTROLLER_IP, server: subdomain.domain.com, request: "POST /app/components/... HTTP/1.1", upstream: "https://CORE_POD_IP:443/app/components/...", host: "subdomain.domain.com"

TASK_POD_IP - user [28/Mar/2024:03:05:23 +0000] "POST /app/components/... HTTP/1.1" 504 104800 "-" "Ruby" 396 60.002 [upstream-default-backend] [] CORE_POD_IP:443 : INGRESS_DEFAULT_BACKEND_POD_IP:80 0 : 104800 60.000 : 0.002 504 : 504 18681ceec65d3ae2a1a27bed9b4fe3e0

I believe at this point, the connection between the nginx container and the core container on core-5f76cc4c55-6gvrb has been closed/dropped/etc. before Connection 2 completed. There's nothing to suggest that Connection 2 timed out, and it was less than a second in execution time, so there was plenty of time for it to respond.

Do you see DENY entries in the network policy agent logs?

To be honest, I'm not sure where to look in the log file for this. But I don't think there would be any DENY entries at all. Both connections were successful to their destinations, it's just that the traffic appears to have been dropped somewhere.

Network policy shouldn't come in to play for this use case at all

This makes sense, I think. But we didn't have client timeout issues until we enforced network policies via the CNI. Prior to this, these issues weren't present in the cluster. Unfortunately we haven't been able to replicate this issue in a lower environment, not sure if that's due to the amount of traffic or not, so I haven't been able to prove either that disabling policy enforcement would fix the problem.

Can you share them with us @ k8s-awscni-triage@amazon.com?

Definitely - I sent the collected node logs for the node where the core pod was running, where we saw the timeout between the containers within the same pod. I sent it as the subject of "Issue aws/amazon-vpc-cni-k8s#204".

The original connection was at (UTC):
2024-03-28T03:04:23

The timeout was at (UTC):
2024-03-28T03:05:23

Nick · Answer 40 · Thu Mar 28 2024 22:09:36 GMT+0800 (China Standard Time)

Definitely - I sent the collected node logs for the node where the core pod was running, where we saw the timeout between the containers within the same pod. I sent it as the subject of "Issue aws/amazon-vpc-cni-k8s#204".

Ah unfortunately I won't be able to send them via email, it looks like Amazon bounces it back as it's slightly over size. Not sure if you happen to have another method to get you that file.

Nick · Answer 41 · Sat Mar 30 2024 03:18:43 GMT+0800 (China Standard Time)

This may be coincidental or unrelated, but looking at the network-policy-agent logs for the nodes where the timeout appears to have occurred between the nginx container and the application container in the same pod, I'm seeing conntrack cleanups a few milliseconds after the connection from the ingress-nginx-controller and the pod is created. It appears these cleanups occur every 5 minutes.

I am wondering if the nginx container is seeing the client timeout to the application container in the same pod only because the connection between the ingress-nginx-controller and the pod itself has been dropped?

Here's three examples (I've substituted out the IP addresses below).

Example 1:
Time of call between ingress-nginx-controller and pod IP: 2024-03-27T07:59:00,140
Source IP of ingress-nginx-controller: 192.1.1.1
Destination IP of pod where the timeout occurs between containers in pod: 192.2.1.1
Conntrack cleanup entries:

{"level":"info","ts":"2024-03-27T07:59:00.204Z","logger":"ebpf-client","caller":"wait/backoff.go:227","msg":"Conntrack cleanup","Entry - ":"Expired/Delete Conntrack Key : Source IP - 192.1.1.1 Source port - 34176 Dest IP - 192.2.1.1 Dest port - 443 Protocol - 6 Owner IP - 192.2.1.1"}
< There are multiple cleanups here for the same source and destination IP >

Example 2:
Time of call between ingress-nginx-controller and pod IP: 2024-03-26T15:53:12,751
Source IP of ingress-nginx-controller: 192.3.1.1
Destination IP of pod where the timeout occurs between containers in pod: 192.4.1.1
Conntrack cleanup entries:

{"level":"info","ts":"2024-03-26T15:53:12.840Z","logger":"ebpf-client","caller":"wait/backoff.go:227","msg":"Conntrack cleanup","Entry - ":"Expired/Delete Conntrack Key : Source IP - 192.3.1.1 Source port - 48212 Dest IP - 192.4.1.1 Dest port - 443 Protocol - 6 Owner IP - 192.4.1.1"}
< There are multiple cleanups here for the same source and destination IP >

Example 3:
Time of call between ingress-nginx-controller and pod IP: 2024-03-28T03:04:23,067
Source IP of ingress-nginx-controller: 192.1.1.1
Destination IP of pod where the timeout occurs between containers in pod: 192.5.1.1
Conntrack cleanup entries:

{"level":"info","ts":"2024-03-28T03:04:23.217Z","logger":"ebpf-client","caller":"wait/backoff.go:227","msg":"Conntrack cleanup","Entry - ":"Expired/Delete Conntrack Key : Source IP - 192.1.1.1 Source port - 41908 Dest IP - 192.5.1.1 Dest port - 443 Protocol - 6 Owner IP - 192.5.1.1"}
< There are multiple cleanups here for the same source and destination IP >

Nick · Answer 42 · Wed Apr 03 2024 03:01:55 GMT+0800 (China Standard Time)

Hey -

Wanted to follow up and see if there were any thoughts around the previous comment and it seeming like the conntrack cleanups always appear to occur almost immediately after we see our call being made (and always ~59 seconds before our client timeout error).

I wasn't sure if there was anything in those node logs that might better prove/disprove the theory.

Thanks!

Jayanth Varavani · Answer 43 · Wed Apr 03 2024 10:08:03 GMT+0800 (China Standard Time)

@ndrafahl - Are you available on K8s slack? Maybe with slack you can attach the zip file to us? Please share you slack id and we can ping you there..

Younsung Lee · Answer 44 · Thu Apr 11 2024 10:38:58 GMT+0800 (China Standard Time)

We're experiencing the similar network policy issue. Opened #245.

Nick · Answer 45 · Tue Apr 30 2024 20:33:47 GMT+0800 (China Standard Time)

Hey guys!

Any chance you might have a timeframe for a GA release of the version that'll have have a fix for the race condition for the conntrack cleanups, or still working on a fix?

Jayanth Varavani · Answer 46 · Wed May 01 2024 00:32:38 GMT+0800 (China Standard Time)

We have the PR's merged and working through release testing. We are planning to have the RC image this week and final build will be before May end..

Jayanth Varavani · Answer 47 · Wed Jun 12 2024 07:32:24 GMT+0800 (China Standard Time)

Fix is released with network policy agent v1.1.2. - https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.18.2. Please test and let us know if there are any issues.