estafette / estafette-gke-preemptible-killer

Kubernetes controller to spread preemption for preemtible VMs in GKE to avoid mass deletion after 24 hours

Home Page:https://helm.estafette.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

nodes do not get deleted

JorritSalverda opened this issue · comments

In one of our Kubernetes Engine clusters nodes that should be deleted do not get removed properly. They're already disabled for scheduling and the pods are evicted, but then the following error is logged when the controller tries to delete the vm:

{
	"time":"2017-11-20T09:23:10Z",
	"severity":"error",
	"app":"estafette-gke-preemptible-killer",
	"version":"1.0.29",
	"error":"Delete https://www.googleapis.com/compute/v1/projects/***/zones/europe-west1-c/instances/gke-development-euro-auto-scaling-pre-33198d65-gq2m?alt=json: dial tcp: i/o timeout",
	"host":"gke-development-euro-auto-scaling-pre-33198d65-gq2m",
	"message":"Error while processing node"
}

can this be a timeout on GCloud side? I wasn't able to see any outage during that period, but if this node doesn't get processed, it should be on the next loop and if the error still persist, maybe there is more information from the logs right before this happen

I'm seeing something similar happen. My guess is this might be happening because kube-dns is being killed before the GCloud client is used, so it fails to resolve the host name when authenticating.

 jsonPayload: {
  app: "estafette-gke-preemptible-killer"
  error: "Delete https://www.googleapis.com/compute/v1/projects/path/to/instance?alt=json: oauth2: cannot fetch token: Post https://oauth2.googleapis.com/token: dial tcp: lookup oauth2.googleapis.com on 10.114.0.10:53: dial udp 10.114.0.10:53: connect: network is unreachable"
  host: "test-pool-cb8bed09-17s6"
  message: "Error deleting GCloud instance"
  version: "1.0.35"
 }

Although kube-dns - if present on the node - is actively deleted by https://github.com/estafette/estafette-gke-preemptible-killer/blob/master/main.go#L296 since kube-dns is running HA this shouldn't be an issue.

However it does turn out that kubernetes engine - built to be resilient - isn't very resilient in the light of preemptions. The master doesn't update services with pods on a preempted node fast enough to no longer send traffic there. We've seen this by having frequent kube-dns issues correlating with real preemptions by Google, not the ones issued by our preemptible-killer.

@JorritSalverda We're getting dns errors intermittently on our GKE preemptibles (with preemptible killer) with services in the cluster trying to resolve other services in the same cluster.
EDIT: It should be noted that we're only having these intermittent connection issues with our preemptibles, the other nodes are having no issues.
I'm asking out of ignorance:
What is the purpose of removing kube-dns on the node?
Would leaving kube-dns on the node remove the dns issues?
And could you clarify your last statement: "We've seen this by having frequent kube-dns issues correlating with real preemptions by Google, not the ones issued by our preemptible-killer."

@jstephens7 we've seen the same and actually moved away from preemptibles for the time being. It's unrelated to this controller, but happens when a node really gets preempted by Google before this controller would do it instead. GKE doesn't handle preemption gracefully, but just kill the node at once. This leaves the Kubernetes master in the blind for a while until it discovers that the node is no longer available. In the mean time the iptables don't get updated and traffic still gets routed to the unavailable node. I would expect this scenario to be handled better, since you want Kubernetes to be resilient in the face of real node malfunction.

For AWS there's actually a notifier that warns you a spot instance is going down, but GCP doesn't have such a thing currently. See https://learnk8s.io/blog/kubernetes-spot-instances for more info.

@JorritSalverda have you completely given up on preemptibles in production (because of this issue)? Just exploring the idea so would love to hear your feedback.

And would @theallseingeye's suggestion mitigate this?

When deleting node, I am experiencing this error

INF Done draining kube-dns from node host=gke-xxxxx
ERR Error deleting GCloud instance error="Delete "https://www.googleapis.com/compute/v1/projects/yyyyyy/zones/europe-west1-b/instances/gke-xxxxxx?alt=json\": oauth2: cannot fetch token: Post "https://oauth2.googleapis.com/token\": x509: certificate signed by unknown authority" host=gke-xxxxx
ERR Error while processing node error="Delete "https://www.googleapis.com/compute/v1/projects/yyyyyy/zones/europe-west1-b/instances/gke-xxxxxx?alt=json\": oauth2: cannot fetch token: Post "https://oauth2.googleapis.com/token\": x509: certificate signed by unknown authority" host=gke-xxxx

I would say that my serviceaccount json is well upload to the pod , and the account has the proper permissions..so I dont know what is happening

Hi @santinoncs, do you use the Helm chart? And what version? We run it with a service account with roles compute.instanceAdmin.v1 on the project the GKE cluster is in. That seems to work fine.

Hi @tmirks we did abandon preemptibles for a while since the pressure on europe-west1 mounted and preemptions became more commonplace. The fact that GKE wasn't aware of preemptions caused a lot of trouble with kube-dns requests getting sent to no longer existing pods. Now we're testing the k8s-node-termination-handler - see Helm chart at https://github.com/estafette/k8s-node-termination-handler - with this application to ensure both GKE is aware of preemptions and preemptions are less likely to happen all at once. Spreading preemptible nodes across zones should also help in reduce changes on mass preemptions.

Hi @santinoncs, do you use the Helm chart? And what version? We run it with a service account with roles compute.instanceAdmin.v1 on the project the GKE cluster is in. That seems to work fine.

Already working when I copy the ca-certificates file to the container.

Just FYI, GKE now handles node preemption gracefully, giving pods about 25 seconds to shut down.