Deprecate MachineHealthCheck MaxUnhealthy and UnhealthyRange

Question

Deprecate MachineHealthCheck MaxUnhealthy and UnhealthyRange

vincepri opened this issue 2 months ago · comments

MachineHealthCheck currently exposes these fields:

        // Any further remediation is only allowed if at most "MaxUnhealthy" machines selected by
	// "selector" are not healthy.
	// +optional
	MaxUnhealthy *intstr.IntOrString `json:"maxUnhealthy,omitempty"`

	// Any further remediation is only allowed if the number of machines selected by "selector" as not healthy
	// is within the range of "UnhealthyRange". Takes precedence over MaxUnhealthy.
	// Eg. "[3-5]" - This means that remediation will be allowed only when:
	// (a) there are at least 3 unhealthy machines (and)
	// (b) there are at most 5 unhealthy machines
	// +optional
	// +kubebuilder:validation:Pattern=^\[[0-9]+-[0-9]+\]$
	UnhealthyRange *string `json:"unhealthyRange,omitempty"`

At a first glance, the fields seems to control remediation, and the comment seems to suggest as such; although in reality they only control setting the conditions not when health checks fail and remediation should occur. This can be a confusing behavior to most users and counter intuitive at best.

For example:

Let's say I have 10 machines in my cluster, and 5 go unhealthy for some reason. The knobs above, if set let's say to only allow 20% or 2, make the MachineHealthCheck to stop setting the condition after 2 machines have been marked and continue if and only if the rest of the Machines. In reality, 5 Machines are unhealthy, but only 2 are marked as such.

Kubernetes Prow Robot · Answer 1 · Tue Jun 04 2024 22:01:00 GMT+0800 (China Standard Time)

This issue is currently awaiting triage.

If CAPI contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Vince Prignano · Answer 2 · Tue Jun 04 2024 22:01:08 GMT+0800 (China Standard Time)

/area machinehealthcheck

Alberto García Lamela · Answer 3 · Tue Jun 04 2024 22:06:49 GMT+0800 (China Standard Time)

Related #5291

Stefan Büringer · Answer 4 · Mon Jun 17 2024 15:35:58 GMT+0800 (China Standard Time)

+1 having configuration that restricts how many machines can be marked as unhealthy seems strange. Instead we should have config restricting remediation (like the recently introduced remediation strategy on MD)

Alberto García Lamela · Answer 5 · Mon Jun 17 2024 15:54:26 GMT+0800 (China Standard Time)

+1 having configuration that restricts how many machines can be marked as unhealthy seems strange. Instead we should have config restricting remediation (like the recently introduced remediation strategy on MD)

Agreed. The current API, while serving well, is more reminiscent of the original MHC design, where default remediation was built into the MHC itself.
Although to make sense of upcoming changes/deprecations I think we'll need to collect community feedback to check if there's any valid scenario where scalable resources might be decoupled from the short-circuiting remediation group as enabled by the current API. E.g. there are MSs per az but one single common maxUnavailable budget for them.

Fabrizio Pandini · Answer 6 · Mon Jul 15 2024 19:21:15 GMT+0800 (China Standard Time)

Reporting an interesting comment from #10853

My only concern before actually drop them in future is to make sure don't break any consumer where scalable resources might be decoupled from the short-circuiting remediation group as enabled by these fields today. E.g. there are MSs per az but one single common maxUnavailable budget for them.

That's an interesting note.
Might be what we need is to be more explicit about the fact that those fields are about short-circuiting remediation (e.g by moving them in a sub struct + amending documentation)

Considering that in both cases (removal or renaming) we need to go through deprecation of current fields