MachineHealthCheck unable to remediate unreachable node with volumes attached
mjlshen opened this issue · comments
What steps did you take and what happened?
- Create a CAPA cluster with at least one machine/node
- Apply a machinehealthcheck that attempts to remediate machines when nodes stop reporting status
spec:
maxUnhealthy: 2
unhealthyConditions:
- status: Unknown
timeout: 8m0s
type: Ready
- Run a pod on the cluster that mounts a persistent volume
- Stop the underlying EC2 instance in AWS
- Observe that the
DrainingSucceeded
status condition on the machine reportsstatus: "True"
after theskipWaitForDelete
timeout during the drain is exceeded (cluster-api/internal/controllers/machine/machine_controller.go
Lines 672 to 675 in a2b7dd1
- The machine is then stuck in a deleting state forever because the volume is not detached
What did you expect to happen?
When a machinehealthcheck is attempting to remediate a machine when its underlying EC2 instance is stopped, I expect that it will successfully drain the node/replace the machine.
Cluster API version
1.7.1
Kubernetes version
v1.27.13+e709aa5
Anything else you would like to add?
I believe that we can address this by setting GracePeriodSeconds: 1
like OpenShift's machinehealthcheck controller:
- OpenShift: https://github.com/openshift/machine-api-operator/blob/dcf1387cb69f8257345b2062cff79a6aefb1f5d9/pkg/controller/machine/drain_controller.go#L164-L171
- CAPI:
cluster-api/internal/controllers/machine/machine_controller.go
Lines 672 to 675 in a2b7dd1
because for unreachable nodes, deleting pods with a specified grace period will allow for successful volume detachment.
Label(s) to be applied
/kind bug
/area machine
/triage accepted
/assign