estafette / estafette-gke-preemptible-killer

Kubernetes controller to spread preemption for preemtible VMs in GKE to avoid mass deletion after 24 hours

Home Page:https://helm.estafette.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Minimum time between kills

JohnPaton opened this issue · comments

It would be nice to have an option for the minimum time between node expirations. Obviously if you have many many nodes then this isn't feasible, but if you only have a handful then it can be quite disruptive to kill several within a few minutes of each other (avoiding this was actually my initial reason for looking into this project).

Maybe something like MINIMUM_KILL_INTERVAL?

I could work on this. It might have some overlap with the blacklist feature. @etiennetremel, would you accept a pull request?

The random component of the kill times is chosen between 0 and 12h. The way I would implement this is to decompose this 12h interval into equal active intervals for each node, each separated by equally inactive intervals == MINIMUM_KILL_INTERVAL.

Obviously if you have many many nodes then this isn't feasible

To solve this, I would put a superior limit on the interval. How much we limit has a direct impact on the randomness of the kill times which is the whole point of this tool.

The following formula applies

12 == m * (n - 1) + a * n

Where:

  • n is node count
  • m is minimum kill interval in hours, can be fractional
  • a is active kill interval in hours, can be fractional

Since node count is predetermined at runtime, I would like to put a limit on either a or m.

a == (12 - m * (n - 1)) / n
m == (12 - a * n) / (n - 1)

(edge cases not dealt with, ignore that, I don't have the brainpower right now to find the perfect formula)

Either a >= x or m <= y where x and y can be arbitrarily chosen by us. I wouldn't impose both conditions since that can always be achieved through only one.

I would impose a >= x with x == 1 [hour]. That would mean:

  • n <= 1: not applicable
  • n = 2: m < 10
  • n = 3: m < 4.5
  • n = 4: m < 2.66
  • n = 5: m < 1.75
  • n = 6: m < 1.2
  • n = 7: m < 0.85
  • n = 8: m < 0.57
  • n = 9: m < 0.375
  • n = 10: m < 0.22
  • n = 11: m < 0.1
  • n >= 12: m == 0

If this condition would not be respected, an error will be shown saying Minimum kill interval is higher than allowed $((12 - a * n) / (n - 1)) (adjust formula for edge cases), you have to allow at least one hour between node kills.

My questions for anyone interested in this feature:

  • Does this explanation make sense to you?
  • Is it still feasible? e.g. if you often times have more than 12 nodes in your cluster and if we keep x == 1 [hour], this feature will be useless to you.
  • Is any other value of x more appropriate?
  • Should we limit m <= y (minimum kill time interval) instead?
  • Or maybe another algorithm is better?

Nice! A few comments:

Since node count is predetermined at runtime

This is not the case with autoscaling clusters. Our number of nodes fluctuates throughout the day. Or do I misunderstand?

I would like to put a limit on either a or m

Why is this necessary? Wouldn't it be easier to allow any minimum interval m >= 0 hours? There could just be a warning stating that if your interval is too long given the number of nodes you have, it will (actual behavior TBD)

  • Not be respected
  • Cause nodes to not be killed beyond the 24h guarantee
  • Some other behavior

This is not the case with autoscaling clusters. Our number of nodes fluctuates throughout the day. Or do I misunderstand?

Well, the killer either runs once and exits or runs periodically with a given time period via -i | --interval. In either cases, the node count is taken at the time it runs, once or periodically, and so I understand that it could vary between runs i.e. on one run it could be 4, on another it could be 8, but I was treating it as a constant to simplify the understanding of the problem. It's important to take into account different values for this constant which I tried to do in the list.

Why is this necessary? Wouldn't it be easier to allow any minimum interval m >= 0 hours?

Consider this then:
With 7 nodes, if you want a minimum kill interval of 2h, you have basically precisely determined when the other nodes will be killed which is at exactly 2h apart. If you think this is too edge-casey, then let's say you want a 1h59m minimum kill interval: then, each node has a 1m wiggle room in which it can die. The problem is the randomness of killing the nodes has gone away fully or in part.

I am not an actual user of this tool, nor do I understand why killing preemptible nodes earlier than the original schedule of 24h is important, I think it's for fault detection?!, so I have to ask: is this precise time of killing a problem? If not, then m >= 0 is the way to go. If yes, then we should allow more wiggle room for each node so we say, let it be more than 1 hour or anything else, I'm completely open to suggestions on the value.

The good thing is, those who do specify the minimum kill interval can be made aware through a help entry, right, so it's not enforced on all users, and by default it will be 0.

Just say the word, and I'll get to work.

In either cases, the node count is taken at the time it runs

Ah okay, I understand. Then indeed it can be viewed as constant. However the issue with determining the constraints is still the same - if the constraint on the minimum time (a config value) is determined by the number of nodes (a value that can change), then the config can be invalidated just because the cluster has scaled up, no?

why killing preemptible nodes earlier than the original schedule of 24h is important

It's because Google will often kill them if they reach that age, so if many of your nodes were created at the same time (new application deployed), then there is a good chance they'll all get killed at that time too. This is compounded by the fact that there are also certain times of day when a lot of killing happens, and if many of your nodes are killed at that time, then tomorrow it's even more likely to happen again. Randomly killing them earlier helps keep the killing spread throughout the day.

is this precise time of killing a problem

No, imo the whole goal here is to spread the killing out as opposed to making it truly "random", though I'm not the creator so they may have different opinions. So I'd vote for m >= 0. But maybe @JorritSalverda or @etiennetremel should weigh in?

Hi, currently the implementation doesn't look at all nodes at once, but just when a node is seen for the first time by the controller - either by the node watcher or the fallback loop - it sets an expiry time without taking any other node into account.

Because it randomises the expiration between 12 and 24 hours it spreads their termination, avoiding a mass termination 24 hours after a large number of nodes have been created simultaneously. Doing it this way keeps the logic extremely simple, but of course might have some unintended side effects where multiple nodes get killed at the same time. When the controller kills a node though it drains pods from the node, so it shouldn't impact operations of your applications too much.

However a couple of things can be improved:

  • keep a map of nodes and expiry times so they can be terminated at the exact expiry time instead of waiting for the next watch event or fallback loop to inspect the node and realise it needs to be terminated
  • use seconds instead of minutes for an expiry datetime, so it's less likely to start termination at the same time
  • respect pod disruption budgets when draining pods

Pull requests for these improvements are more than welcome!

We rely less on preemptibles these days since in our region Google started preempting nodes much more frequently, often before this controller terminated them. And when they preempt a node GKE doesn't get notified of this and thinks pods are still running on that node for a while, leading to issues with kube-dns. For that reason we're not spending much time on improving this controller.

These days however a combination of this controller with https://github.com/GoogleCloudPlatform/k8s-node-termination-handler would be a good option, because it also ensures nodes get cleaned up in a better way in case of real preemptions.

Respecting PDBs would already be a great start indeed. I raised this issue because of a slow-starting application that was disrupted by its nodes getting killed all within the startup time, leading to an outage. We do also run the node termination handler for "real" preemptions but this is a good way to avoid those also happening all at the same time (since the preemptions also obviously don't respect PDBs)

Done #69. Give it a spin.

Here's a stress test run.

$ ./scripts/all-in-one-test -w '12:00 - 17:00' -m '2h' -i 10

Annotations:        estafette.io/gke-preemptible-killer-state: 2020-03-03T14:31:54Z
CreationTimestamp:  Mon, 02 Mar 2020 21:34:37 +0200
Annotations:        estafette.io/gke-preemptible-killer-state: 2020-03-03T16:31:19Z
CreationTimestamp:  Mon, 02 Mar 2020 21:34:37 +0200
Annotations:        estafette.io/gke-preemptible-killer-state: 2020-03-03T12:27:27Z
CreationTimestamp:  Mon, 02 Mar 2020 21:34:37 +0200

Hi @andrei-pavel , @JohnPaton sorry for being slow to respond to this issue and the PR. Parking something to look at it later sometimes leads to it escaping my attention fully :|

I think a lot of weird workarounds we're trying to add to this controller could be avoided my making the application PDB aware instead. What do you think? If a node comes up for preemption we would have to check all PDBs related to the applications running on that node and await any of them that don't allow for disruptions (probably up to 45 minutes or so, just like most other PDB aware parts of Kubernetes).

That would make sense to me, and it would become "automatically" interoperable with existing configs so 👍

Selecting the pdb's for each pod is done as follows in the cluster autoscaler:

https://github.com/kubernetes/autoscaler/blob/c94ade5266a6c4515aa4a0c7365af0ffb1cf8989/cluster-autoscaler/simulator/drain.go#L93-L109

Doesn't seem to efficient, but given that it's not executed very often this would be suitable for deleting pods in this application.