amazonlinux / amazon-linux-2023

Amazon Linux 2023

Home Page:https://aws.amazon.com/linux/amazon-linux-2023/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Bug] - Timeouts in execution in forked commands

wonko opened this issue · comments

Describe the bug
We had issues running haproxy with a check-command configured in a container on EKS, using AL2023-EKS as our image for the worker nodes.

The result was a timeout on the check-command. We believe this is only a single manifestation of the issue, we suspect it to be present on other processes as well. It. is unclear what exactly goes wrong, but it feels like the process can't fork, or that once forked, the process can't run. We suspected ulimit, but that all seemed correct and sufficiently high. strace pointed to high amount of calls to polling, but it was unclear why that might happen. dmesg didn't have any pointers.

To Reproduce
Steps to reproduce the behavior:

  1. Deploy EKS 1.29 cluster
  2. Deploy worker nodes with AL2023-EKS-1.29-20240415
  3. Using helm, install percona operator and the database for mysql/galera: https://docs.percona.com/percona-operator-for-mysql/pxc/helm.html
  4. Observe the behaviour inside the haproxy containers:
  • proxy layer isn't getting ready, it keeps failing
  • high CPU load
  • reports that the external check command timed out

We tried changing the external check command to the simple /bin/true, but that didn't help. This ruled out that the problem was with the specific check script.

When switching to Bottlerocket or AL2, and redeploying exactly the same, all goes well. We verified switching back, and the problems just returned. Various machines types have been tried, same issue. Various versions of the operator/database helm chart (using different images), same result.

Expected behavior
Stable environment, all systems online and working. No restarts. As is with Bottlerocket or AL2.

I am unsure if related, but I have experienced odd timeouts with EKS and AL23 nodes. At high load, we see networking timeouts just trying to access the kubernetes service from pods as well as trying to get logs from pods. Reverting to AL2 resolved this issues completely for us.

I brought this to AWS Support for EKS and they haven't been able to find anything of note.

any updates here? having to revert down to AL2 isn't a viable long-term solution

Can you specify the Node AMI version that displayed this behavior?

all i have is "AL2023-EKS-1.29-20240415"

@wonko - I launched a curl container inside a pod, that can repeatedly (more than 100 times) communicate with the API Server. We didn't see any connectivity issue on AL2023 and AMI - 1.29.3-20240625

It is perhaps kind of the client that is playing a part here?