kubernetes-sigs / aws-ebs-csi-driver

CSI driver for Amazon EBS https://aws.amazon.com/ebs/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Upgrade to latest liveness-probe v2.12.0 for inclusion of bug fix

bboerst opened this issue · comments

/kind bug

What happened?
We're experiencing intermittent ebs-csi-node pod crashing, which I believe is related to a bug that the liveness-probe maintainers fixed in their latest release v2.12.0 (Release Notes mentioning bugfix)

What you expected to happen?
It would be great if the aws-ebs-csi-driver helm chart maintained here could test and bump to this new release of liveness-probe, here: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/v1.27.0/charts/aws-ebs-csi-driver/values.yaml#L88

How to reproduce it (as minimally and precisely as possible)?
This issue is very intermittent, we sporadically see pods restarting shortly after starting on a new node with events like:

  Warning  Unhealthy  9m22s (x4 over 9m52s)  kubelet            Liveness probe failed: Get "http://10.144.71.143:9808/healthz": dial tcp 10.144.71.143:9808: connect: connection refused
  Warning  BackOff    9m12s (x3 over 9m39s)  kubelet            Back-off restarting failed container liveness-probe in pod ebs-csi-node-gq7x9_kube-system(0e7dcb87-b667-45b5-a188-10a4af844d39)

often the liveness-probe logs will show:

17969 connection.go:183] Still connecting to unix:///csi/csi.sock
17969 connection.go:183] Still connecting to unix:///csi/csi.sock
17969 connection.go:183] Still connecting to unix:///csi/csi.sock
17969 main.go:146] failed to establish connection to CSI driver: context deadline exceeded

Anything else we need to know?:
This GH issue explains the history, specifically a new timeout introduced in the csi-lib-utils upstream dependency that now has a downstream impact to liveness-probe: kubernetes-csi/livenessprobe#236

Environment

  • Kubernetes version (use kubectl version): v1.26.9
  • Driver version: 1.24.1

Hi @bboerst, thanks for bringing up this livenessprobe issue and mentioning the fix.

We bump kubernetes-csi sidecar images like the liveness probe right before our monthly release. In other words, v2.28.0 of our helm chart will bump livenessprobe to v2.12.0 (specifically public.ecr.aws/eks-distro/kubernetes-csi/livenessprobe:v2.12.0-eks-1-29-4).

We're planning to start this release next week. I'll update this issue with a link to the dependency bump PR when that is created and also let you know when we have released a new version of our helm chart.

I hope this plan is satisfactory for you. Thanks again and have a great day!

/close

The latest version of the EBS CSI Driver helm chart (v2.28.0 using driver version v1.28.0) and Kustomize manifests (release-1.28 branch) use sidecar version v2.12.0 which includes the fix.

For customers of the EKS Managed Addon version of the driver, we expect a realize to be finalized and available in all commercial regions by end of week.

@ConnorJC3: Closing this issue.

In response to this:

/close

The latest version of the EBS CSI Driver helm chart (v2.28.0 using driver version v1.28.0) and Kustomize manifests (release-1.28 branch) use sidecar version v2.12.0 which includes the fix.

For customers of the EKS Managed Addon version of the driver, we expect a realize to be finalized and available in all commercial regions by end of week.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.