kubernetes-sigs / cluster-api

Home for Cluster API, a subproject of sig-cluster-lifecycle

Home Page:https://cluster-api.sigs.k8s.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Deprecate MachineHealthCheck MaxUnhealthy and UnhealthyRange

vincepri opened this issue · comments

MachineHealthCheck currently exposes these fields:

        // Any further remediation is only allowed if at most "MaxUnhealthy" machines selected by
	// "selector" are not healthy.
	// +optional
	MaxUnhealthy *intstr.IntOrString `json:"maxUnhealthy,omitempty"`

	// Any further remediation is only allowed if the number of machines selected by "selector" as not healthy
	// is within the range of "UnhealthyRange". Takes precedence over MaxUnhealthy.
	// Eg. "[3-5]" - This means that remediation will be allowed only when:
	// (a) there are at least 3 unhealthy machines (and)
	// (b) there are at most 5 unhealthy machines
	// +optional
	// +kubebuilder:validation:Pattern=^\[[0-9]+-[0-9]+\]$
	UnhealthyRange *string `json:"unhealthyRange,omitempty"`

At a first glance, the fields seems to control remediation, and the comment seems to suggest as such; although in reality they only control setting the conditions not when health checks fail and remediation should occur. This can be a confusing behavior to most users and counter intuitive at best.

For example:

Let's say I have 10 machines in my cluster, and 5 go unhealthy for some reason. The knobs above, if set let's say to only allow 20% or 2, make the MachineHealthCheck to stop setting the condition after 2 machines have been marked and continue if and only if the rest of the Machines. In reality, 5 Machines are unhealthy, but only 2 are marked as such.

This issue is currently awaiting triage.

If CAPI contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

/area machinehealthcheck

+1 having configuration that restricts how many machines can be marked as unhealthy seems strange. Instead we should have config restricting remediation (like the recently introduced remediation strategy on MD)

+1 having configuration that restricts how many machines can be marked as unhealthy seems strange. Instead we should have config restricting remediation (like the recently introduced remediation strategy on MD)

Agreed. The current API, while serving well, is more reminiscent of the original MHC design, where default remediation was built into the MHC itself.
Although to make sense of upcoming changes/deprecations I think we'll need to collect community feedback to check if there's any valid scenario where scalable resources might be decoupled from the short-circuiting remediation group as enabled by the current API. E.g. there are MSs per az but one single common maxUnavailable budget for them.

Reporting an interesting comment from #10853

My only concern before actually drop them in future is to make sure don't break any consumer where scalable resources might be decoupled from the short-circuiting remediation group as enabled by these fields today. E.g. there are MSs per az but one single common maxUnavailable budget for them.

That's an interesting note.
Might be what we need is to be more explicit about the fact that those fields are about short-circuiting remediation (e.g by moving them in a sub struct + amending documentation)

Considering that in both cases (removal or renaming) we need to go through deprecation of current fields