StarpTech / k-andy

Low cost Kubernetes stack for startups, prototypes, and playgrounds on Hetzner Cloud.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Documentation: Proof of high availability?

pintoflager opened this issue · comments

This must be a stupid question but I'm going to ask it anyways; How does one test that HA actually works?

I ran the installer with defaults, applied the example hello-kubernetes manifest and ended up with:
k3s-agent-0
k3s-agent-1
k3s-control-plane-0
k3s-control-plane-1
loadbalancer --- balancing traffic between k3s-agent-0 and k3s-agent-1

Everything gets installed perfectly and hello-kubernetes shows up in http://loadbalancer-ip:8080

Now when I try to cause havoc in the cluster by powering off either k3s-agent-0 or k3s-agent-1 kubernetes should spin up hello-kubernetes pods into the remaining agent node, loadbalancer should re-route traffic to that remaining agent node and hello-kubernetes should keep on showing up as usual?

Reading from kubernetes docs I was expecting hello-kubernetes app to stay alive:

Each Pod is tied to the Node where it is scheduled, and remains there until termination (according to restart policy) or deletion. In case of a Node failure, identical Pods are scheduled on other available Nodes in the cluster.

Instead when I was powering on and off agent nodes hello-kubernetes app died with the agent node where the pod replicas (2) were scheduled.

Also noticed that both k3s-control-plane-0 and k3s-control-plane-1 has to be up for the kubectl to connect to the cluster. Killing either one of the two prevents access to the cluster. This would need another loadbalancer right?

Thank you for your work on this project it has helped me forward with learning kubernetes.

I think this is a quorum problem. https://news.ycombinator.com/item?id=11991639
Please correct me if I'm wrong.

I was testing k3s locally and 2 + 2 fails on my laptop too.
3 servers and 2 agents works fine though.

Hi @pintoflager very typical and important question. Honestly, I only use k-andy for playgrounds. I didn't test it intensively for availability. In theory, you're right. If one worker node goes down the pod is rescheduled to the other node. Worker nodes should be functional without control planes. I don't maintain K3s or Hetzner Cloud Controller. I recommend asking the question in those repositories or slack channels.

Instead when I was powering on and off agent nodes hello-kubernetes app died with the agent node where the pod replicas (2) were scheduled.

This happens because the pod in hello-kubernetes has a node-affinity due to the volume mount. If you deploy it without the volume and shutdown the node the pod is rescheduled after some minutes to the other node. I tested it.

Also noticed that both k3s-control-plane-0 and k3s-control-plane-1 has to be up for the kubectl to connect to the cluster. Killing either one of the two prevents access to the cluster. This would need another loadbalancer right?

Yes, that's right. We use the fixed IP of k3s-control-plane-0.

Also noticed that both k3s-control-plane-0 and k3s-control-plane-1 has to be up for the kubectl to connect to the cluster. Killing either one of the two prevents access to the cluster. This would need another loadbalancer right?

I think this is a quorum problem. news.ycombinator.com/item?id=11991639
Please correct me if I'm wrong.

I was testing k3s locally and 2 + 2 fails on my laptop too.
3 servers and 2 agents works fine though.

You're right. Not sure, why they recommend two or more. https://rancher.com/docs/k3s/latest/en/installation/ha/ I will set it to 3 by default. I'd open an issue in the k3s repository.

Indeed. The description in https://rancher.com/docs/k3s/latest/en/installation/ha/ is related to an external DB. Etcd requires at least three nodes to make a consensus https://etcd.io/docs/v3.3/faq/#should-i-add-a-member-before-removing-an-unhealthy-member