kubernetes-sigs / cluster-api

Home for Cluster API, a subproject of sig-cluster-lifecycle

Home Page:https://cluster-api.sigs.k8s.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

MachineHealthCheck unable to remediate unreachable node with volumes attached

mjlshen opened this issue · comments

What steps did you take and what happened?

  • Create a CAPA cluster with at least one machine/node
  • Apply a machinehealthcheck that attempts to remediate machines when nodes stop reporting status
spec:
  maxUnhealthy: 2
  unhealthyConditions:
  - status: Unknown
    timeout: 8m0s
    type: Ready
  • Run a pod on the cluster that mounts a persistent volume
  • Stop the underlying EC2 instance in AWS
  • Observe that the DrainingSucceeded status condition on the machine reports status: "True" after the skipWaitForDelete timeout during the drain is exceeded (
    if noderefutil.IsNodeUnreachable(node) {
    // When the node is unreachable and some pods are not evicted for as long as this timeout, we ignore them.
    drainer.SkipWaitForDeleteTimeoutSeconds = 60 * 5 // 5 minutes
    }
    )
  • The machine is then stuck in a deleting state forever because the volume is not detached

What did you expect to happen?

When a machinehealthcheck is attempting to remediate a machine when its underlying EC2 instance is stopped, I expect that it will successfully drain the node/replace the machine.

Cluster API version

1.7.1

Kubernetes version

v1.27.13+e709aa5

Anything else you would like to add?

I believe that we can address this by setting GracePeriodSeconds: 1 like OpenShift's machinehealthcheck controller:

because for unreachable nodes, deleting pods with a specified grace period will allow for successful volume detachment.

Label(s) to be applied

/kind bug
/area machine

/triage accepted

/assign