estafette / estafette-gke-preemptible-killer

Kubernetes controller to spread preemption for preemtible VMs in GKE to avoid mass deletion after 24 hours

Home Page:https://helm.estafette.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Disruption occurs when preemptible-killer is deployed too late

zvictor opened this issue · comments

First of all, I would like to thank the team behind this OSS for their great work!

I also would like to mention that I have just started giving estafette-gke-preemptible-killer a try, so I am not confident yet that what I am about to describe is a bug or is actually an intended behaviour. Nonetheless, I am sharing my experience here as it might spark some constructive discussion.


I had 3 preemptive nodes running for 21h. At the moment I deployed estafette-gke-preemptible-killer 2 were immediately killed, which triggered 3 new ones to be created. Most of my containers had to be recreated at the same time, so I ended up with 7min of downtime of my application.

My understanding is that estafette-gke-preemptible-killer assigned death time to the nodes based purely on their creation time, which lead to their death time being assigned in the past.

Given that this tool's goal is to spread out to avoid the risk of all getting deleted at the same time, my questions are:

  • shouldn't we base the death time on the remaining life expectancy instead of the whole life expectancy? ([min(elapsedTime, 12), 24h] instead of [12h, 24h])

  • should we consider the death time of other nodes while defining the death time of a new node? If we want to avoid the risk of all getting deleted at the same time, we should probably be adding extra precautions instead of only relying on random death times.

$ kubectl logs -l app=estafette-gke-preemptible-killer -n preemption-controller

{"time":"2019-03-20T12:52:25Z","severity":"info","app":"estafette-gke-preemptible-killer","version":"1.0.35","branch":"master","revision":"013c835f8b689b76a1eba2580a71d623dc343bd8","buildDate":"2017-09-12T15:51:07Z","goVersion":"go1.9","message":"Starting estafette-gke-preemptible-killer..."}
{"time":"2019-03-20T12:52:25Z","severity":"info","app":"estafette-gke-preemptible-killer","version":"1.0.35","message":"Listing all preemptible nodes for cluster..."}
{"time":"2019-03-20T12:52:25Z","severity":"info","app":"estafette-gke-preemptible-killer","version":"1.0.35","port":":9001","path":"/metrics","message":"Serving Prometheus metrics..."}
{"time":"2019-03-20T12:52:25Z","severity":"info","app":"estafette-gke-preemptible-killer","version":"1.0.35","message":"Cluster has 4 preemptible nodes"}
{"time":"2019-03-20T12:52:25Z","severity":"info","app":"estafette-gke-preemptible-killer","version":"1.0.35","host":"gke-staging-default-pool-a5413e78-bnpc","message":"Annotation not found, adding estafette.io/gke-preemptible-killer-state to 2019-03-21T05:46:11Z"}
{"time":"2019-03-20T12:52:25Z","severity":"info","app":"estafette-gke-preemptible-killer","version":"1.0.35","host":"gke-staging-default-pool-a5413e78-bnpc","message":"1014 minute(s) to go before kill, keeping node"}
{"time":"2019-03-20T12:52:25Z","severity":"info","app":"estafette-gke-preemptible-killer","version":"1.0.35","host":"gke-staging-n1-standard-2-pool-e69f5d31-gmn6","message":"Annotation not found, adding estafette.io/gke-preemptible-killer-state to 2019-03-21T03:46:48Z"}
{"time":"2019-03-20T12:52:25Z","severity":"info","app":"estafette-gke-preemptible-killer","version":"1.0.35","host":"gke-staging-n1-standard-2-pool-e69f5d31-gmn6","message":"894 minute(s) to go before kill, keeping node"}
{"time":"2019-03-20T12:52:25Z","severity":"info","app":"estafette-gke-preemptible-killer","version":"1.0.35","host":"gke-staging-n1-standard-2-pool-e69f5d31-jgn9","message":"Annotation not found, adding estafette.io/gke-preemptible-killer-state to 2019-03-21T01:46:47Z"}
{"time":"2019-03-20T12:52:25Z","severity":"info","app":"estafette-gke-preemptible-killer","version":"1.0.35","host":"gke-staging-n1-standard-2-pool-e69f5d31-jgn9","message":"774 minute(s) to go before kill, keeping node"}
{"time":"2019-03-20T12:52:25Z","severity":"info","app":"estafette-gke-preemptible-killer","version":"1.0.35","host":"gke-staging-n1-standard-8-pool-eaca26b3-wc5j","message":"Node expired -508 minute(s) ago, deleting..."}
{"time":"2019-03-20T12:52:26Z","severity":"info","app":"estafette-gke-preemptible-killer","version":"1.0.35","host":"gke-staging-n1-standard-8-pool-eaca26b3-wc5j","message":"1 pod(s) found"}
{"time":"2019-03-20T12:52:26Z","severity":"info","app":"estafette-gke-preemptible-killer","version":"1.0.35","host":"gke-staging-n1-standard-8-pool-eaca26b3-wc5j","message":"Deleting pod estafette-gke-preemptible-killer-6779ffb67-z8m2p"}
{"time":"2019-03-20T12:52:26Z","severity":"info","app":"estafette-gke-preemptible-killer","version":"1.0.35","host":"gke-staging-n1-standard-8-pool-eaca26b3-wc5j","message":"1 pod(s) pending deletion, sleeping 9s"}
{"time":"2019-03-20T12:52:35Z","severity":"info","app":"estafette-gke-preemptible-killer","version":"1.0.35","host":"gke-staging-n1-standard-8-pool-eaca26b3-wc5j","message":"1 pod(s) pending deletion, sleeping 10s"}
{"time":"2019-03-20T12:52:45Z","severity":"info","app":"estafette-gke-preemptible-killer","version":"1.0.35","host":"gke-staging-n1-standard-8-pool-eaca26b3-wc5j","message":"1 pod(s) pending deletion, sleeping 9s"}
{"time":"2019-03-20T12:52:54Z","severity":"info","app":"estafette-gke-preemptible-killer","version":"1.0.35","host":"gke-staging-n1-standard-8-pool-eaca26b3-wc5j","message":"1 pod(s) pending deletion, sleeping 8s"}
{"time":"2019-03-20T12:53:02Z","severity":"info","app":"estafette-gke-preemptible-killer","version":"1.0.35","host":"gke-staging-n1-standard-8-pool-eaca26b3-wc5j","message":"1 pod(s) pending deletion, sleeping 8s"}
{"time":"2019-03-20T12:53:10Z","severity":"info","app":"estafette-gke-preemptible-killer","version":"1.0.35","host":"gke-staging-n1-standard-8-pool-eaca26b3-wc5j","message":"1 pod(s) pending deletion, sleeping 8s"}
{"time":"2019-03-20T12:53:18Z","severity":"info","app":"estafette-gke-preemptible-killer","version":"1.0.35","host":"gke-staging-n1-standard-8-pool-eaca26b3-wc5j","message":"1 pod(s) pending deletion, sleeping 10s"}
{"time":"2019-03-20T12:53:28Z","severity":"info","app":"estafette-gke-preemptible-killer","version":"1.0.35","host":"gke-staging-n1-standard-8-pool-eaca26b3-wc5j","message":"1 pod(s) pending deletion, sleeping 11s"}
{"time":"2019-03-20T12:53:39Z","severity":"info","app":"estafette-gke-preemptible-killer","version":"1.0.35","host":"gke-staging-n1-standard-8-pool-eaca26b3-wc5j","message":"1 pod(s) pending deletion, sleeping 10s"}
{"time":"2019-03-20T12:53:49Z","severity":"info","app":"estafette-gke-preemptible-killer","version":"1.0.35","host":"gke-staging-n1-standard-8-pool-eaca26b3-wc5j","message":"1 pod(s) pending deletion, sleeping 9s"}
{"time":"2019-03-20T12:53:58Z","severity":"info","app":"estafette-gke-preemptible-killer","version":"1.0.35","host":"gke-staging-n1-standard-8-pool-eaca26b3-wc5j","message":"1 pod(s) pending deletion, sleeping 8s"}
{"time":"2019-03-20T12:54:06Z","severity":"info","app":"estafette-gke-preemptible-killer","version":"1.0.35","host":"gke-staging-n1-standard-8-pool-eaca26b3-wc5j","message":"1 pod(s) pending deletion, sleeping 10s"}
{"time":"2019-03-20T12:54:16Z","severity":"info","app":"estafette-gke-preemptible-killer","version":"1.0.35","host":"gke-staging-n1-standard-8-pool-eaca26b3-wc5j","message":"1 pod(s) pending deletion, sleeping 11s"}
{"time":"2019-03-20T12:54:27Z","severity":"info","app":"estafette-gke-preemptible-killer","version":"1.0.35","host":"gke-staging-n1-standard-8-pool-eaca26b3-wc5j","message":"1 pod(s) pending deletion, sleeping 9s"}
{"time":"2019-03-20T12:54:36Z","severity":"info","app":"estafette-gke-preemptible-killer","version":"1.0.35","host":"gke-staging-n1-standard-8-pool-eaca26b3-wc5j","message":"1 pod(s) pending deletion, sleeping 9s"}
{"time":"2019-03-20T12:54:45Z","severity":"info","app":"estafette-gke-preemptible-killer","version":"1.0.35","host":"gke-staging-n1-standard-8-pool-eaca26b3-wc5j","message":"1 pod(s) pending deletion, sleeping 10s"}
{"time":"2019-03-20T12:54:55Z","severity":"info","app":"estafette-gke-preemptible-killer","version":"1.0.35","host":"gke-staging-n1-standard-8-pool-eaca26b3-wc5j","message":"1 pod(s) pending deletion, sleeping 11s"}
{"time":"2019-03-20T12:55:07Z","severity":"info","app":"estafette-gke-preemptible-killer","version":"1.0.35","host":"gke-staging-n1-standard-8-pool-eaca26b3-wc5j","message":"1 pod(s) pending deletion, sleeping 10s"}
{"time":"2019-03-20T12:55:17Z","severity":"info","app":"estafette-gke-preemptible-killer","version":"1.0.35","host":"gke-staging-n1-standard-8-pool-eaca26b3-wc5j","message":"1 pod(s) pending deletion, sleeping 8s"}
{"time":"2019-03-20T12:55:25Z","severity":"info","app":"estafette-gke-preemptible-killer","version":"1.0.35","host":"gke-staging-n1-standard-8-pool-eaca26b3-wc5j","message":"1 pod(s) pending deletion, sleeping 10s"}
{"time":"2019-03-20T12:55:35Z","severity":"info","app":"estafette-gke-preemptible-killer","version":"1.0.35","host":"gke-staging-n1-standard-8-pool-eaca26b3-wc5j","message":"1 pod(s) pending deletion, sleeping 11s"}
{"time":"2019-03-20T12:55:46Z","severity":"info","app":"estafette-gke-preemptible-killer","version":"1.0.35","host":"gke-staging-n1-standard-8-pool-eaca26b3-wc5j","message":"Done draining node"}
{"time":"2019-03-20T12:55:48Z","severity":"info","app":"estafette-gke-preemptible-killer","version":"1.0.35","host":"gke-staging-n1-standard-8-pool-eaca26b3-wc5j","message":"Node deleted"}
{"time":"2019-03-20T12:55:48Z","severity":"info","app":"estafette-gke-preemptible-killer","version":"1.0.35","message":"Sleeping for 731 seconds..."}
$ kubectl get nodes

NAME                                           STATUS                     ROLES     AGE       VERSION
gke-staging-n1-standard-1-pool-a5413e78-bnpc         Ready                      <none>    14m       v1.12.5-gke.10
gke-staging-n1-standard-2-pool-e69f5d31-gmn6   Ready                      <none>    14m       v1.12.5-gke.10
gke-staging-n1-standard-2-pool-e69f5d31-jgn9   Ready                      <none>    14m       v1.12.5-gke.10
gke-staging-n1-standard-8-pool-eaca26b3-wc5j   Ready,SchedulingDisabled   <none>    21h       v1.12.5-gke.10

I agree, that make total sense, feel free to open a PR if you have spare time.