Delayed MHC replacement of unreachable nodes
typeid opened this issue · comments
typeid commented
What steps did you take and what happened?
- Create a CAPA cluster with at least one machine/node
- Apply a machinehealthcheck that attempts to remediate machines when nodes stop reporting status
spec:
maxUnhealthy: 2
unhealthyConditions:
- status: Unknown
timeout: 8m0s
type: Ready
- Stop the underlying EC2 instance in AWS
- The machine is marked as unhealthy after 8 minutes, but the drain takes a long time
What did you expect to happen?
Once the MHC determines the node is unhealthy and attempts to replace it, it should skip waiting for graceful pod termination and just delete the pods as soon as possible.
Cluster API version
1.17.1
Kubernetes version
v1.27.13+e709aa5
Anything else you would like to add?
When a node is unreachable, the drain still considers the pod grace periods but skips waiting for pods to delete after 5 minutes:
We should align the behavior with OpenShift MAPI: https://github.com/openshift/machine-api-operator/blob/dcf1387cb69f8257345b2062cff79a6aefb1f5d9/pkg/controller/machine/drain_controller.go#L164-L171
Label(s) to be applied
/kind bug
/area machine
typeid commented
/assign
Alberto García Lamela commented
/triage accepted