Documentation: Proof of high availability?

Question

Documentation: Proof of high availability?

pintoflager opened this issue 3 years ago · comments

This must be a stupid question but I'm going to ask it anyways; How does one test that HA actually works?

I ran the installer with defaults, applied the example hello-kubernetes manifest and ended up with:
k3s-agent-0
k3s-agent-1
k3s-control-plane-0
k3s-control-plane-1
loadbalancer --- balancing traffic between k3s-agent-0 and k3s-agent-1

Everything gets installed perfectly and hello-kubernetes shows up in http://loadbalancer-ip:8080

Now when I try to cause havoc in the cluster by powering off either k3s-agent-0 or k3s-agent-1 kubernetes should spin up hello-kubernetes pods into the remaining agent node, loadbalancer should re-route traffic to that remaining agent node and hello-kubernetes should keep on showing up as usual?

Reading from kubernetes docs I was expecting hello-kubernetes app to stay alive:

Each Pod is tied to the Node where it is scheduled, and remains there until termination (according to restart policy) or deletion. In case of a Node failure, identical Pods are scheduled on other available Nodes in the cluster.

Instead when I was powering on and off agent nodes hello-kubernetes app died with the agent node where the pod replicas (2) were scheduled.

Also noticed that both k3s-control-plane-0 and k3s-control-plane-1 has to be up for the kubectl to connect to the cluster. Killing either one of the two prevents access to the cluster. This would need another loadbalancer right?

Thank you for your work on this project it has helped me forward with learning kubernetes.

pintoflager · Answer 1 · Mon Jun 14 2021 00:05:06 GMT+0800 (China Standard Time)

I think this is a quorum problem. https://news.ycombinator.com/item?id=11991639
Please correct me if I'm wrong.

I was testing k3s locally and 2 + 2 fails on my laptop too.
3 servers and 2 agents works fine though.

Dustin Deus · Answer 2 · Mon Jun 14 2021 04:10:34 GMT+0800 (China Standard Time)

Hi @pintoflager very typical and important question. Honestly, I only use k-andy for playgrounds. I didn't test it intensively for availability. In theory, you're right. If one worker node goes down the pod is rescheduled to the other node. Worker nodes should be functional without control planes. I don't maintain K3s or Hetzner Cloud Controller. I recommend asking the question in those repositories or slack channels.

Instead when I was powering on and off agent nodes hello-kubernetes app died with the agent node where the pod replicas (2) were scheduled.

This happens because the pod in hello-kubernetes has a node-affinity due to the volume mount. If you deploy it without the volume and shutdown the node the pod is rescheduled after some minutes to the other node. I tested it.

Also noticed that both k3s-control-plane-0 and k3s-control-plane-1 has to be up for the kubectl to connect to the cluster. Killing either one of the two prevents access to the cluster. This would need another loadbalancer right?

Yes, that's right. We use the fixed IP of k3s-control-plane-0.

Dustin Deus · Answer 3 · Mon Jun 14 2021 04:29:38 GMT+0800 (China Standard Time)

Also noticed that both k3s-control-plane-0 and k3s-control-plane-1 has to be up for the kubectl to connect to the cluster. Killing either one of the two prevents access to the cluster. This would need another loadbalancer right?

I think this is a quorum problem. news.ycombinator.com/item?id=11991639
Please correct me if I'm wrong.

I was testing k3s locally and 2 + 2 fails on my laptop too.
3 servers and 2 agents works fine though.

You're right. Not sure, why they recommend two or more. https://rancher.com/docs/k3s/latest/en/installation/ha/ I will set it to 3 by default. I'd open an issue in the k3s repository.

Dustin Deus · Answer 4 · Sat Jun 19 2021 02:10:32 GMT+0800 (China Standard Time)

Indeed. The description in https://rancher.com/docs/k3s/latest/en/installation/ha/ is related to an external DB. Etcd requires at least three nodes to make a consensus https://etcd.io/docs/v3.3/faq/#should-i-add-a-member-before-removing-an-unhealthy-member