[Bug] - Timeouts in execution in forked commands

Question

[Bug] - Timeouts in execution in forked commands

wonko opened this issue 3 months ago · comments

Bernard Grymonpon commented 3 months ago

Describe the bug
We had issues running haproxy with a check-command configured in a container on EKS, using AL2023-EKS as our image for the worker nodes.

The result was a timeout on the check-command. We believe this is only a single manifestation of the issue, we suspect it to be present on other processes as well. It. is unclear what exactly goes wrong, but it feels like the process can't fork, or that once forked, the process can't run. We suspected ulimit, but that all seemed correct and sufficiently high. strace pointed to high amount of calls to polling, but it was unclear why that might happen. dmesg didn't have any pointers.

To Reproduce
Steps to reproduce the behavior:

Deploy EKS 1.29 cluster
Deploy worker nodes with AL2023-EKS-1.29-20240415
Using helm, install percona operator and the database for mysql/galera: https://docs.percona.com/percona-operator-for-mysql/pxc/helm.html
Observe the behaviour inside the haproxy containers:

proxy layer isn't getting ready, it keeps failing
high CPU load
reports that the external check command timed out

We tried changing the external check command to the simple /bin/true, but that didn't help. This ruled out that the problem was with the specific check script.

When switching to Bottlerocket or AL2, and redeploying exactly the same, all goes well. We verified switching back, and the problems just returned. Various machines types have been tried, same issue. Various versions of the operator/database helm chart (using different images), same result.

Expected behavior
Stable environment, all systems online and working. No restarts. As is with Bottlerocket or AL2.

Cody Grant · Answer 1 · Fri Jun 14 2024 10:15:19 GMT+0800 (China Standard Time)

I am unsure if related, but I have experienced odd timeouts with EKS and AL23 nodes. At high load, we see networking timeouts just trying to access the kubernetes service from pods as well as trying to get logs from pods. Reverting to AL2 resolved this issues completely for us.

I brought this to AWS Support for EKS and they haven't been able to find anything of note.

Aaron Giordano-Barry · Answer 2 · Thu Jun 20 2024 22:57:24 GMT+0800 (China Standard Time)

any updates here? having to revert down to AL2 isn't a viable long-term solution

Senthil Kumaran · Answer 3 · Tue Jun 25 2024 08:34:25 GMT+0800 (China Standard Time)

Can you specify the Node AMI version that displayed this behavior?

Bernard Grymonpon · Answer 4 · Tue Jun 25 2024 14:34:19 GMT+0800 (China Standard Time)

all i have is "AL2023-EKS-1.29-20240415"

Senthil Kumaran · Answer 5 · Fri Jun 28 2024 05:29:50 GMT+0800 (China Standard Time)

@wonko - I launched a curl container inside a pod, that can repeatedly (more than 100 times) communicate with the API Server. We didn't see any connectivity issue on AL2023 and AMI - 1.29.3-20240625

It is perhaps kind of the client that is playing a part here?