kubernetes-sigs / cluster-api

What steps did you take and what happened?

Create a CAPA cluster with at least one machine/node
Apply a machinehealthcheck that attempts to remediate machines when nodes stop reporting status

spec:
  maxUnhealthy: 2
  unhealthyConditions:
  - status: Unknown
    timeout: 8m0s
    type: Ready

Stop the underlying EC2 instance in AWS
The machine is marked as unhealthy after 8 minutes, but the drain takes a long time

What did you expect to happen?

Once the MHC determines the node is unhealthy and attempts to replace it, it should skip waiting for graceful pod termination and just delete the pods as soon as possible.

Cluster API version

1.17.1

Kubernetes version

v1.27.13+e709aa5

Anything else you would like to add?

When a node is unreachable, the drain still considers the pod grace periods but skips waiting for pods to delete after 5 minutes:

cluster-api/internal/controllers/machine/machine_controller.go

Line 674 in 0539a29

drainer.SkipWaitForDeleteTimeoutSeconds = 60 * 5 // 5 minutes

We should align the behavior with OpenShift MAPI: https://github.com/openshift/machine-api-operator/blob/dcf1387cb69f8257345b2062cff79a6aefb1f5d9/pkg/controller/machine/drain_controller.go#L164-L171

Label(s) to be applied

/kind bug
/area machine

/assign

/triage accepted