estafette / estafette-gke-preemptible-killer

Kubernetes controller to spread preemption for preemtible VMs in GKE to avoid mass deletion after 24 hours

Home Page:https://helm.estafette.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

panic: sync: WaitGroup is reused before previous Wait has returned

smartyjohn opened this issue · comments

Running version 1.1.5 (though 1.1.2 also showed the same behavior). Prior 1.0.x versions I've used have not had the panic in the logs. Relevant log portions:

2019-03-12T16:46:33Z "1 pod(s) pending deletion, sleeping 8s"
2019-03-12T16:46:37Z "Draining node timeout reached"
2019-03-12T16:46:37Z "0 kube-dns pod(s) found"
2019-03-12T16:46:37Z "Done draining kube-dns from node"
2019-03-12T16:46:38Z "Node deleted"
2019-03-12T16:46:38Z "322 minute(s) to go before kill, keeping node"
2019-03-12T16:46:38Z "Sleeping for 640 seconds..."
panic: sync: WaitGroup is reused before previous Wait has returned
goroutine 1 [running]:
sync.(*WaitGroup).Wait(0xc000222000)
 	/usr/local/go/src/sync/waitgroup.go:132 +0xae
main.main()
 	/estafette-work/main.go:171 +0x956

Seems it may be related to when the node the killer is on is self-killed. The rest of the logs seems to indicate another killer processes was spun up in the prior minute or two. Both processes then alternate messages like "1 pod(s) pending deletion, sleeping 9s".

The second newly-created killer pod (which ran 8s after the above process) has the expected notices that the node has already been deleted:

2019-03-12T16:46:46Z "Draining node timeout reached"
2019-03-12T16:46:46Z "0 kube-dns pod(s) found"
2019-03-12T16:46:46Z "Done draining kube-dns from node"
2019-03-12T16:46:46Z "kubernetes api: Failure 404 nodes \"[...trimmed...]\" not found","Error deleting node"
2019-03-12T16:46:46Z "kubernetes api: Failure 404 nodes \"[...trimmed...]\" not found","Error while processing node"

The new pod then continues on normally, and the old pod apparently dies and is no more.

One way to fix it would be to use the downward api and inject the node name as environment variable then add a condition in the kubernetes.DrainNode function to prevent deleting it-self (the estafette-gke-preemptible-killer pod) then the node will be deleted, and the pod re-scheduled on another node.