Race condition causes quickly opened connections to fail

Question

Race condition causes quickly opened connections to fail

dave-powell-kensho opened this issue 6 months ago · comments

dave-powell-kensho commented 6 months ago

What happened:

After enabling network policy support we observed applications which opened connections early in their lifecycle would become hung. It appears that the process had established a connection successfully, and were stuck in a read syscall indefinitely.

# ss -nitp
State                         Recv-Q                         Send-Q                                                 Local Address:Port                                                      Peer Address:Port                         Process                         
ESTAB                         0                              0                                                       xx.xx.xx.xx:44670                                                   xx.xx.xx.xx:443                           xx
         cubic wscale:13,7 rto:204 rtt:2.217/0.927 ato:40 mss:1388 pmtu:9001 rcvmss:1448 advmss:8949 cwnd:10 bytes_sent:783 bytes_acked:784 bytes_received:4735 segs_out:6 segs_in:7 data_segs_out:3 data_segs_in:4 send 50085701bps lastsnd:904904 lastrcv:904904 lastack:904900 pacing_rate 100160104bps delivery_rate 11131824bps delivered:4 app_limited busy:8ms rcv_space:56587 rcv_ssthresh:56587 minrtt:1.339 snd_wnd:65536

The process becomes stuck while in a read syscall

# cat /proc/8/syscall 
0 0x3 0x56505f69dcc3 0x5 0x0 0x0 0x0 0x7ffe65dd4f38 0x7f94ae42b07d

This occurred across multiple disparate deployments with the common feature being early outbound connections. When debugging the affected pods, we found that we were able to open outbound connections without issue. Our theory is that the application is opening connections early in the pod lifecycle before the agent gets going, and once the network policy agent does its work, the connection is affected. In these cases we had no egress filtering network policies applied to the pods, but did have ingress filters.

Attach logs

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):
Create a pod that immediately attempts to download a large enough file to last several seconds. The request ends up hanging, but executing the same request on the same pod after some period of initialization succeeds.

Anything else we need to know?:
Possibly related to #144 ?

Environment:

Kubernetes version (use kubectl version): 1.26
CNI Version: 1.15.3
Network Policy Agent Version: 1.0.5 and 1.0.8rc1
OS (e.g: cat /etc/os-release): AL2
Kernel (e.g. uname -a): 5.10.192-183.736.amzn2.x86_64 #1 SMP Wed Sep 6 21:15:41 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Nick · Answer 1 · Mon Feb 05 2024 23:29:35 GMT+0800 (China Standard Time)

Hey @dave-powell-kensho - did you end up finding a resolution to your problem?

We are seeing somewhat similar results in our environment after enabling the network policy enforcement via the AWS CNI addon, that we were not seeing prior to enforcing the network policies. I don't think though that's it's because they're occurring too quickly after the pod comes up.

We are seeing that connections appear to be successful initially, but are timing out due to a "Read Timeout" on the client's side. We know the initial connection is successful because there are actions that are being taken on the server side, and a retry of the same action basically gives us a response of "you already did this".

In other cases, we're seeing that connections that have no timeout enforced basically stay open indefinitely, and we have to forcefully reboot the pods (namely, connections to an RDS instance).

dave-powell-kensho · Answer 2 · Mon Feb 05 2024 23:36:07 GMT+0800 (China Standard Time)

@ndrafahl We were not able to find a resolution that left the netpol agent running, and rolled back/unwound the netpol agent changes. We have not experienced these issues since the rollback ~2 weeks ago.

Nick · Answer 3 · Mon Feb 05 2024 23:39:08 GMT+0800 (China Standard Time)

Did you basically go through these steps, for your rollback?:

Deleted all of your ingress network policies in the cluster
Set enable-network-policy-controller to false in ConfigMap amazon-vpc-cni (kube-system NS).
Set the 'enableNetworkPolicy' parameter to false. This will disable the agents on the nodes.

Out of curiosity, did you also try updating the addon to 1.16.x? That's been one suggestion that has been made to us, but we haven't yet taken that path. Right now we're trying to figure out which direction to take.

Sorry - to add one additional question, did you guys do the same thing in other environments without any sort of seen issue there?

dave-powell-kensho · Answer 4 · Tue Feb 06 2024 03:09:31 GMT+0800 (China Standard Time)

Yes, those are the steps we took, though we also removed the netpol agent container from the daemonset at the end.

I'm not aware of that version - I had seen similar issues with requests from the developers to try 1.0.8rc1, which we did upgrading to (from 1.0.5) with the same results.

We were able to replicate this issue in multiple environments. We have left this addon enabled in our lowest environment so that we're able to test any potential fixes quickly.

dave-powell-kensho · Answer 5 · Tue Feb 06 2024 03:12:01 GMT+0800 (China Standard Time)

cc @jayanthvn We've been sitting on this issue for a couple weeks now and would really appreciate some eyes from the maintainers.

Nick · Answer 6 · Tue Feb 06 2024 04:26:45 GMT+0800 (China Standard Time)

Did you find that you also needed to remove the node agent from the daemonset as well, after those steps, to get your issue to be resolved?

I tested the steps in a lower environment, and sure enough that container is still running on the pod even though the addon is set to not enforce network policies any longer.

Jayanth Varavani · Answer 7 · Tue Feb 06 2024 06:02:14 GMT+0800 (China Standard Time)

@dave-powell-kensho - Sorry somehow lost track of this. This is expected behavior if the connection is established prior to policy reconciliation against the new pod. Please see this - #189 (comment)

dave-powell-kensho · Answer 8 · Wed Feb 07 2024 04:43:03 GMT+0800 (China Standard Time)

@ndrafahl We removed the node agent from the pod's container list, yes, though we self-manage the aws-node deployment config, so I cannot advise on helm charts and the like.

@jayanthvn Thank you for the update, we'll be looking forward to the release of the strict mode feature. Is there any issue or other location we can track to know when it is released?

Nick · Answer 9 · Wed Feb 07 2024 05:04:37 GMT+0800 (China Standard Time)

@dave-powell-kensho Thanks for the info, appreciate you responding. 👍

Jeffrey Nelson · Answer 10 · Sat Feb 17 2024 01:03:41 GMT+0800 (China Standard Time)

@dave-powell-kensho you can track the progress of #209 and its release

Ariary · Answer 11 · Thu Apr 04 2024 16:51:24 GMT+0800 (China Standard Time)

@jdn5126 @jayanthvn I'm not sure the Strict Mode solved the issue.

In fact, the issue is describing especially a bug in the standard potion of the Strict Mode which is (still) blocking some traffic.

Last tests with v1.17.1-eksbuild.1 and standard: short-lived connections are still blocked (while explicitly allowed by network policies + after some times pod is able to perform same connection without any issue)