kubernetes / kubeadm

Aggregator for issues filed against kubeadm

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Workarounds for the time before kubeadm HA becomes available

mbert opened this issue · comments

The planned HA features in kubeadm are not going to make it into v1.9 (see #261). So what can be done to make a cluster setup by kubeadm sufficiently HA?

This is what it looks like now:

  • Worker nodes can be scaled up to achieve acceptable redundance.
  • Without a working active/active or at least active/passive master setup, master failures are likely to cause significant disruptions.

Hence an active/active or active/passive master setup needs to be created (i.e. mimic what kubeadm would supposedly be doing in the futue):

  1. Replace the local etcd pod by an etcd cluster of min. 2 x number-of-masters size. This cluster could running on the OS rather than in K8s.
  2. Set up more master instances. That's the interesting bit. The Kubernetes guide for building HA clusters (https://kubernetes.io/docs/admin/high-availability/) can be of help to understand what needs to be done. Here I'd like to have simple step-by-step instructions taking into consideration kubeadm-setup particularities in the end.
  3. Not sure whether this is necessary: Probably set up haproxy/keepalived on the master hosts, move the original master's IP address plus SSL termination to it.

This seems achievable if converting the existing master instance to a cluster of masters (2) can be done (the Kubernetes guide for building HA clusters seems to indicate so). Active/active would be not more expensive than active/passive.

I am currently working on this. If I succeed I shall share what I find out here.

See also https://github.com/cookeem/kubeadm-ha - this seems to cover what I want to achieve here.

@mbert we started implementing the HA features and chopped wood on the underlying dependency stack now in v1.9, but it's a short cycle for a big task, so the work will continue in v1.10 as you pointed out.

For v1.9, we will document what you're describing here in the official docs though; how to achieve HA with external deps like setting up a LB

Excellent. I am digging through all this right now. I am currently stuck at bootstrapping master 2 and 3, in particular how to configure kubelet and apiserver (how much can I reuse from master 1?) and etcd (I am thinking of using a bootstrap etc on a separate machine for discovery). The guide from the docs is a bit terse when it comes to this.

commented

@mbert I have been following your comments here and I just want to let you know I followed the guide in docs and was able to stand up a working HA k8s cluster using kubeadm (v1.8.x).

If you are following this setup and you need to bootstrap master 2 and 3, you can reuse almost everything from the first master. You then need to fix up the following configuration files on master 2 and 3 to reflect the current host: /etc/kubernetes/manifests/kube-apiserver.yaml, /etc/kubernetes/kubelet.conf, /etc/kubernetes/admin.conf, and /etc/kubernetes/controller-manager.conf

Regarding etcd, if you follow this guide docs you should stand up an external 3-node etcd cluster that spans across the 3 k8s master nodes.

There is also one 'gotcha' item that has NOT yet been covered in the guide docs.
You can see this issue for detail: cookeem/kubeadm-ha#6

I also asked a few questions related to kubeadm HA from this post: cookeem/kubeadm-ha#7

I really hope that can give me some thoughts on these.

Thank you in advance for your time.

commented

This is great - definitely need this as I am sure 99% of kubeadm users have a nagging paranoia in the back of their heads about ha of their master(s).

@kcao3 thank you. I will look into this all on coming Monday. So I understand that it is OK to use identical certificates on all three masters?

If yes, I assume that next I'll try will be bring up kubelet and apiserver on master 2 and 3 using the configuration from master 1 (with modified IPs and host names in there of course) and then bootstrap the etcd cluster by putting a modified etcd.yaml into /etc/kubernetes/manifests.

Today I ran into problems because the running etcd on master 1 already had cluster information in its data dir which I had to remove first, but I was still running into problems. I guess some good nights of sleep will be helpful.

Once I've got this running I shall document the whole process and publish it.

@srflaxu40 yep, and in particular if you have an application that indirectly requires apiserver at runtime (legacy application and service discovery in my case) you cannot afford to lose the only master at any time.

Convert the single-instance etcd to a cluster

I have been able to get replacing the single etcd instance by a cluster in a fresh K8s cluster. The steps are roughly these:

  1. Set up a separate etcd server. This etcd instance is only needed for bootstrapping the cluster. Generate a discovery URL for 3 nodes on it (see https://coreos.com/etcd/docs/latest/op-guide/clustering.html#etcd-discovery).
  2. Copy /etc/kubernetes from master 1 to masters 2 and 3. Substitute host name and IP in /etc/kubernetes/*.* and /etc/kubernetes/manifests/*.*
  3. Create replacements to /etc/kubernetes/manifests/etcd.yaml for all three masters: set all announcement URLs to the respective hosts' primary IPs, all listen URLs to 0.0.0.0, add the discovery URL from step 1. I used the attached JINJA2 template file etcd.yaml.j2.txt together with ansible.
  4. Copy the etcd.yaml replacements to /etc/kubernetes/manifests on all three master nodes.
  5. Now things get time critical. Wait for the local etcd process to terminate, then move /var/lib/etcd/member/wal somewhere else before the new process comes up (otherwise it will ignore the discovery URL).
  6. When the new etcd comes up it will now wait for the remaining two instances to join. Hence, quickly launch kubelet on the other two master nodes.
  7. Follow the etcd container's logs on the first master to see if something went completely wrong. If things are OK, then after some minutes the cluster will be operational again.

Step 5 is somewhat awkward, and I have found that if I miss the right time here or need too much time to get the other two masters to join (step 6) my cluster gets into a state from which it can hardly
recover. When this happened, the simplest solution I found was to shut down kubelet on master 2 and 3, run kubeadm reset on all masters and minions, clear the /var/lib/etcd directories on all masters and set up a new cluster using kubeadm init.

While this works, I'd be interested in possible improvements: Is there any alternative, more elegant and robust approach to this (provided that I still want to follow the approach of running etcd in containers on the masters)?

This comment aims to collect feedback and hints at an early stage. I will post updates on the next steps in a similar way before finally documenting this as a followable guide.

@mbert Why do not you use a independent ETCD cluster instead of creating in the k8s?

@KeithTt Thank you for your feedback. I was thinking about these here:

  1. Not to use any data.
  2. Stay as close to kubeadm's setup as possible.
  3. Have it supervised by K8s and integrated in whatever monitoring I set up for my system.
  4. Keep the number of services running on the OS low.
  5. It wouldn't make things easier since I'd still have to deal with (4) above.

If an independent etcd cluster's advantages outweigh the above list, I shall be happy to be convinced otherwise.

@mbert Please make sure you sync with @jamiehannaford on this effort, he's also working on this / committed to making these docs a thing in v1.9

@mbert are you available to join our SIG meeting today 9PT or the kubeadm implementation PR tomorrow 9PT? I'd love to discuss this with you in a call 👍

@luxas actually it was @jamiehannaford who asked me to open this issue. Once I have got things running and documented I hope to get lots of feedback from him.
9PT, that's in an hour, right? That would be fine. Just let me know how to connect with you.

Following guides here and there i manage to do it here is my final steps

@mbert

Created - not converted - 3 master-node cluster using kubeadm with 3 node etcd cluster deployed on kubernetes

Here's what I needed to do:

  1. Create 3 master node cluster using kubeadm on barebone servers
  2. Deploy etcd cluster on 3 master nodes using kubeadm
  3. Use non-default pod-network cidr /27

Problems:

  1. Using non-default pod-network cidr is impossible to setup using kubeadm init
  2. No documentation on creating multi-master cluster on barebone exists. Other docs are not as detailed as could be

The way I did it was using kubeadm alpha phase steps, short list follows:

on all master nodes:

  1. Start docker - not kubelet

on masternode1:

  1. Create CA certs
  2. Create apiserver certs with --apiserver-advertise-address, --service-cidr, --apiserver-cert-extra-sans parameters used. Here, really only --apiserver-cert-extra-sans is mandatory.
  3. Create rest of the certs needed
  4. Create kubeconfig and controlplane configs
  5. Edit newly created yaml files in /etc/kubernetes/manifest directory to add any extra options you need.
    For me, here's where I set non-default pod-network CIDR of /27 in kube-control-manager.yaml. Also, remove NodeRestriction from --admission-control
  6. Copy previously prepared yaml file for etd cluster in /etc/kubernetes/manifest directory
  7. Copy /etc/kubernetes directory to rest of the master nodes and edit all the files needed to configure them for masternode2 and masternode3.
  8. Once all files are reconfigured, start kubelet ON ALL 3 MASTER NODES.
  9. Once all nodes are up, taint all master-nodes
  10. Bootstrap all tokens
  11. Create token for joining worker nodes
  12. Edit previously created masterConfig.yaml and update token parameter
  13. Upload masterConfig to kubernetes
  14. Install addons
  15. Generate --discovery-token-ca-cert-hash and add worker nodes

This is really short list of what I did and it can be automated and reproduced in 5 minutes. Also, for me the greatest bonus was I was able to set non-standard pod-network CIDR as I had that restriction of not being able to spare B class IP address range.

If you're interested in more detailed version, please let me know and I'll try and create some docs on how this was done.

@dimitrijezivkovic thank you for your comment. I think it would make sense to put all the relevant information together so that one piece of documentation comes out.

I plan to set up a google docs document and start documenting what I did (which is pretty bare-bones). I would then invite others to join and write extensions, corrections, comments?

I have now "documented" a very simple setup in form of a small ansible project: https://github.com/mbert/kubeadm2ha

It is of course still work in progress, but it already allows to set up a multi-master cluster without any bells and whistles. I have tried to keep it as simple as possible so that by reading one should be able to find out pretty easily what needs to be done in which order.

Tomorrow I will start writing this up as a simple cooking recipe in a google docs document and invite others to collaborate.

Just to call it out explicitly, there's a bunch of orthogonal issues mashed together in the above conversation/suggestions. It might be useful to break these out separately, and perhaps prioritise some above others:

  • etcd data durability (multi etcd. Requires 2+ etcd nodes)
  • etcd data availability (multi etcd+redundancy. Requires 3+ etcd nodes)
  • apiserver availability (multi apiserver. Requires a loadbalancer/VIP or (at least) DNS with multiple A records)
  • cm/scheduler availability (multi cm/scheduler. Requires 2+ master nodes, and replicas=2+ on these jobs)
  • reboot-all-the-masters recovery (a challenge for self-hosted - requires some form of persistent pods for control plane)
  • kubeadm upgrade support for multi-apiserver/cm-scheduler (varies depending on self-hosted vs non-self-hosted)

Imo the bare minimum we need is etcd durability (or perhaps availability), and the rest can wait. That removes the "fear" factor, while still requiring some manual intervention to recover from a primary master failure (ie: an active/passive setup of sorts).

I think the details of the rest depend hugely on self-hosted vs "legacy", so I feel like it would simplify greatly if we just decided now to assume self-hosted (or not?) - or we clearly fork the workarounds/docs into those two buckets so we don't confuse readers by chopping and changing.

Aside: One of the challenges here is that just about everything to do with install+upgrade changes if you assume a self-hosted+HA setup (it mostly simplifies everything because you can use rolling upgrades, and in-built k8s machinery). I feel that by continually postponing this setup we've actually made it harder for ourselves to reach that eventual goal, and I worry that we're just going to keep pushing the "real" setup back further while we work on perfecting irrelevant single-master upgrades :( I would rather we addressed the HA setup first, and then worked backwards to try to produce a single-host approximation if required (perhaps by packing duplicate jobs temporarily onto the single host), rather than trying to solve single-host and then somehow think that experience will help us with multi-host.

@mbert I have achieved the HA proposal by generating the certs manually for each node, and without deleting NodeRestriction, I use haproxy+keepalived as loadbalancer now, maybe lvs+keepalived will be better, I will document the details in this weekend, hope to share with u.

image

FYI all, @mbert has started working on a great WIP guide for kubeadm HA manually that we'll add to the v1.9 kubeadm docs eventually: https://docs.google.com/document/d/1rEMFuHo3rBJfFapKBInjCqm2d7xGkXzh0FpFO0cRuqg/edit

Please take a look at the doc everyone, and provide your comments. We'll soon-ish convert this into markdown and send as a PR to kubernetes/website.

Thank you @mbert and all the others that are active in thread, this will be a great collaboration!

@mbert / @luxas: that doc doesn't allow comments (for me at least 😢)

Done, I had the wrong setting in the doc.

commented

@mbert I have a question for you. Following your approach, assuming I have a functioning HA k8s cluster. Do you know how to add new k8s masters to my existing cluster? The issue I am facing now is the certs that were generated based on the FIXED number of k8s master hosts at the time the cluster was bootstrapped. This now prevents any new master to join the cluster. From the kubelet's log of the new master, you would see something like this: "... x509: certificate is valid for 192.168.1.x, 192.168.1.y,192.168.1.z not 192.168.1.n." ( where .x,.y,.z are the IP address of the current masters, and .n is the address of the new master). Do you know how to resolve this issue? Do the master nodes must use the same certificates in this case?

@kcao3 I am not very familiar with this particular aspect. Maybe @jamiehannaford can tell you more about this?

@kcao3 Each master join will generate TLS assets using the specific IPv4 for that server. The config also accepts additional SANs, which should include the LB IPv4 which sits in front of the masters. I have a HA guide in review, so check that out if you have time.

I have just pushed a new commit to https://github.com/mbert/kubeadm2ha

  • flannel networking is now supported (and default)
  • there's a basic installation for the dashboard (NodePort network, insecure, i.e. no SSL) as a separate playbook
  • code cleanup
commented

@mbert I just read the HA guide from @jamiehannaford : https://github.com/jamiehannaford/kubernetes.github.io/blob/3663090ea9b9a29a00c79dd2916e11737ccf1802/docs/setup/independent/high-availability.md. Is it possible on each of the master node, we can have kubeadm to generate and signed separate certificates using the same CA.crt and CA.key?

So the only things that need to be copied from the primary master to the secondary masters are the CA.crt and CA.key. With this approach, on each master (including primary and secondary), we will run 'kubeadm init' using a generated kubeadm configuration file based on a template like the following:

apiVersion: kubeadm.k8s.io/v1alpha1
kind: MasterConfiguration
kubernetesVersion: v{{ KUBERNETES_VERSION }}
networking:
  podSubnet: {{ POD_NETWORK_CIDR }}
api:
  advertiseAddress: {{ MASTER_VIP }}
apiServerCertSANs:
- {{ MASTER_VIP }}
etcd:
  endpoints:
{% for host in groups['masters'] %}
  - http://{{ hostvars[host]['ansible_default_ipv4']['address'] }}:2379
{% endfor %}

If this approach works, it will allow k8s admins to add any new master to their existing multi-masters cluster down the road.

Any thoughts?

@kcao3 That's what I'm trying to do. I figured out I also need to pre-generate proxy CA cert+keys which are different.
But now when I run kubeadm init on my masters, all components come up properly but the kube-proxy still fails due to authentication issues, even though the front-proxy-client.crt is now signed by the same CA on all nodes.

@discordianfish I also ran into auth issues but when deploying Flannel. Wonder if it's related to what you're seeing.

In the meantime I figured out the the 'proxy CA' (frontend-proxy-*) isn't related to kube-proxy. Still trying to figure out what is going on, it looks though like there is no system:node-proxier role but I don't know what is suppose to create it.

Since the frontend-proxy stuff was a red herring, I'm starting over on a clean slate now. But would be great if someone could confirm that it should work to create the CA credentials and just run init on all masters? Given the right advertiseAddress, SANs and etcd endpoints of course?
Because I'm most worried that kubeadm still somehow generates local secrets other masters don't know about.

When my masters come up, kube-proxy is working first but kube-proxy on the last master fails. When I recreated the pods, all fail. So when running kubeadm init again the same etcd multiple times from different hosts, it somehow breaks the authentication.

The service account looks correct and has a secret:

$ kubectl -n kube-system get ds kube-proxy -o yaml|grep serviceAccount
      serviceAccount: kube-proxy
      serviceAccountName: kube-proxy

$ kubectl -n kube-system get sa kube-proxy -o yaml|grep -A1 secrets
secrets:
- name: kube-proxy-token-5ll9k

$ kubectl -n kube-system get secret kube-proxy-token-5ll9k
NAME                     TYPE                                  DATA      AGE
kube-proxy-token-5ll9k   kubernetes.io/service-account-token   3         16m

This service account is bound to a role too:

$ kubectl get clusterrolebindings kubeadm:node-proxier -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  creationTimestamp: 2017-12-07T12:52:54Z
  name: kubeadm:node-proxier
  resourceVersion: "181"
  selfLink: /apis/rbac.authorization.k8s.io/v1/clusterrolebindings/kubeadm%3Anode-proxier
  uid: 8a9638df-db4d-11e7-8d7e-0e580b140468
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: system:node-proxier
subjects:
- kind: ServiceAccount
  name: kube-proxy
  namespace: kube-system

And the role exist and is looking good:

$ kubectl get clusterrole system:node-proxier -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  annotations:
    rbac.authorization.kubernetes.io/autoupdate: "true"
  creationTimestamp: 2017-12-07T12:52:51Z
  labels:
    kubernetes.io/bootstrapping: rbac-defaults
  name: system:node-proxier
  resourceVersion: "63"
  selfLink: /apis/rbac.authorization.k8s.io/v1/clusterroles/system%3Anode-proxier
  uid: 88dfc662-db4d-11e7-8d7e-0e580b140468
rules:
- apiGroups:
  - ""
  resources:
  - endpoints
  - services
  verbs:
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - nodes
  verbs:
  - get
- apiGroups:
  - ""
  resources:
  - events
  verbs:
  - create
  - patch
  - update

So not sure what is going one. From how I understand everything, this should work like this but the apiserver keeps logging: E1207 13:18:20.697707 1 authentication.go:64] Unable to authenticate the request due to an error: [invalid bearer token, [invalid bearer token, crypto/rsa: verification error]]

Okay, so it looks like the token is only accepted by one instance of my apiservers, probably on the master where kubeadm init last ran. I thought the service account tokens get stored in etcd?

Mystery solved thanks for gintas and foxie in #kubernetes-users: We also need to pre-generate the sa keys and distribute them along with the CA.

I followed @jamiehannaford's HA guide fairly closely and eventually reached a working HA cluster (set up in a Vagrant setting with a HAProxy load-balancer fronting three master nodes), but I hit a few obstacles along the way and thought I'd share them here since they are probably relevant irrespective of approach:

  • It is important that the etcd version is compatible with the Kubernetes version your running. From what I can gather the guide targets k8s 1.9 and therefore uses etcd v3.1.10. For a k8s 1.8 installation (which I was targeting) , you should use v3.0.17 (using v3.1.17 caused kubeadm to choke, failing to extract the etcd version).

  • I had to run etcd using systemd, since running it as a static pods under /etc/kubernetes/manifests would cause kubeadm preflight checks to fail (it expects that directory to be empty).

  • Before running kubeadm init on master1 and master2, you need to wait for master0 to generate certificates and, in addition to /etc/kubernetes/pki/ca.{crt,key}, copy the /etc/kubernetes/pki/sa.key and /etc/kubernetes/pki/sa.pub files to master1 and master2 (as hinted by @discordianfish). Otherwise, master1 and master2 will generate service account token signing certificates of their own, which in my case caused kube-proxy on those hosts to fail to authenticate against the apiserver.

    There are also the files front-proxy-ca.{crt,key} and front-proxy-client.{crt,key} which I did not copy. I'm unsure if they should have been copied from master0 as well, but things appear to be working anyway.

  • The "regular" kubeadm installation guide encourages you to configure Docker to use the systemd cgroup driver. For me, that also required me to pass --cgroup-driver=systemd to the kubelet via KUBELET_EXTRA_ARGS.

@petergardfjall Ha it's funny to see how you run into exactly the same issues. So yeah as of yesterday my multi-HA cluster also works. I ran into #590 though, did you find a nice solution for that?
Didn't had to use a special etcd version. I think I'm just using the defaults in coreos' stable etcd-wrapper.
Regarding the front-proxy stuff.. I frankly have no idea what it is.

@discordianfish: I did not run into #590 . I used a kubeadm config file with

api:
  advertiseAddress: <apiserver-loadbalancer-ip>

and it appears to have been picked up by the kube-proxy config map.

> kubectl get cm -n kube-system kube-proxy -o yaml
apiVersion: v1
data:
  kubeconfig.conf: |
    apiVersion: v1
    kind: Config
    clusters:
    - cluster:
        certificate-authority: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        server: https://<apiserver-loadbalancer-ip>:6443
      name: default

Ah okay. Right, it works with a load balancer ip but you don't get a stable IP when running on AWS and using an ELB, so need to use a name.

@discordianfish I see, that may actually become a problem since I'm planning on running it in AWS later on. How did you work around that?

@jamiehannaford in the HA guide you make references to using cloud-native loadbalancers. Did you experiment with that? Did you manage to get around #590?

No, haven't found a solution yet. Right now it's just a note in my docs to edit this config map manually.

And I just shot me in the foot with this: kubeadm init on a new master will overwrite the configmap and kubernetes/kubernetes#57109 makes it even harder to realize this.

So from what I can tell there is no way to use kubeadm right now in a multi-master setup, without falling back to executing alpha phases manually.

@jamiehannaford's HA guide misses this in general. A cluster created like this will have the IP of a single master hardcoded and breaks ones this goes away.

Hello

I just experimented a bit with this and I think I have a working setup now.
So here is what I did:

The experiment was performed on DigtialOcean with 4x 20$ droplets (3 master + 1 worker)

First I created 3 droplet (CoreOS stable):

master1: 188.166.76.108
master2: 188.166.29.53
master3: 188.166.76.133

I then rain the following script on every node to configure the needed pieces to use kubeadm with CoreOS:

#!/bin/bash
set -o nounset -o errexit

RELEASE="$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)"
CNI_VERSION="v0.6.0"

mkdir -p /opt/bin
cd /opt/bin
curl -L --remote-name-all https://storage.googleapis.com/kubernetes-release/release/${RELEASE}/bin/linux/amd64/{kubeadm,kubelet,kubectl}
chmod +x {kubeadm,kubelet,kubectl}

mkdir -p /opt/cni/bin
curl -L "https://github.com/containernetworking/plugins/releases/download/${CNI_VERSION}/cni-plugins-amd64-${CNI_VERSION}.tgz" | tar -C /opt/cni/bin -xz

BRANCH="release-$(cut -f1-2 -d .<<< "${RELEASE##v}")"
cd "/etc/systemd/system/"
curl -L "https://raw.githubusercontent.com/kubernetes/kubernetes/${BRANCH}/build/debs/kubelet.service" | sed 's:/usr/bin:/opt/bin:g' > kubelet.service
mkdir -p "/etc/systemd/system/kubelet.service.d"
cd "/etc/systemd/system/kubelet.service.d"
curl -L "https://raw.githubusercontent.com/kubernetes/kubernetes/${BRANCH}/build/debs/10-kubeadm.conf" | sed 's:/usr/bin:/opt/bin:g' > 10-kubeadm.conf

Create the initial master:

core@master-01 ~ $ sudo kubeadm init --pod-network-cidr=10.244.0.0/16 --apiserver-cert-extra-sans="127.0.0.1,188.166.76.108,188.166.29.53,188.166.76.133"
[...]
  kubeadm join --token b11224.fada30ef8a7cbd38 188.166.76.108:6443 --discovery-token-ca-cert-hash sha256:19d34ff6e69203a799ab5984a212684b3dcd446ca5e9d6f6c1a8ae422583b62a
[...]
core@master-01 ~ $ sudo kubectl --kubeconfig=/etc/kubernetes/admin.conf apply -f https://raw.githubusercontent.com/coreos/flannel/v0.9.1/Documentation/kube-flannel.yml
core@master-01 ~ $ sudo systemctl enable kubelet docker

Next we need to create a etcd cluster, so change the etcd manifest so etcd listen for peers on all interfaces (WARNING: This isn't safe, in production you should at least use TLS for peer authentication/communication)

core@master-01 ~ $ sudo vi /etc/kubernetes/manifests/etcd.yaml
# add --listen-peer-urls=http://0.0.0.0:2380 as a command arg
core@master-01 ~ $ sudo systemctl restart kubelet # for some reason, kubelet does not pick up the change

Change the default etcd member peer-url to the public ipv4 ip:

core@master-01 ~ $ ETCDCTL_API=3 etcdctl member list
8e9e05c52164694d, started, default, http://localhost:2380, http://127.0.0.1:2379

core@master-01 ~ $ ETCDCTL_API=3 etcdctl member update 8e9e05c52164694d --peer-urls="http://188.166.76.108:2380"

Now copy all the kubernetes files (manifests/pki) to the other master nodes:

$ eval $(ssh-agent)
$ ssh-add <path to ssh key>
$ ssh -A core@188.166.29.53 # master-02
core@master-02 ~ $ sudo -E rsync -aP --rsync-path="sudo rsync" core@188.166.76.108:/etc/kubernetes/ /etc/kubernetes
$ ssh -A core@188.166.76.133 # master-03
core@master-03 ~ $ sudo -E rsync -aP --rsync-path="sudo rsync" core@188.166.76.108:/etc/kubernetes/ /etc/kubernetes

Add master-02 to the etcd cluster:

core@master-01 ~ $ ETCDCTL_API=3 etcdctl member add member-02 --peer-urls="http://188.166.29.53:2380"
Member  b52af82cbbc8f30 added to cluster cdf818194e3a8c32

ETCD_NAME="member-02"
ETCD_INITIAL_CLUSTER="member-02=http://188.166.29.53:2380,default=http://188.166.76.108:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"

$ ssh core@188.166.29.53 # master-02
core@master-02 ~ $ sudo vi /etc/kubernetes/manifests/etcd.yaml
# Add the following as args:
--name=member-02
--initial-cluster=member-02=http://188.166.29.53:2380,default=http://188.166.76.108:2380
--initial-cluster-state=existing
core@master-02 ~ $ sudo systemctl restart kubelet

Add master-03 to the etcd cluster:

core@master-01 ~ $ ETCDCTL_API=3 etcdctl member add master-03 --peer-urls="http://188.166.76.133:2380"
Member 874cba873a1f1e81 added to cluster cdf818194e3a8c32

ETCD_NAME="master-03"
ETCD_INITIAL_CLUSTER="member-02=http://188.166.29.53:2380,master-03=http://188.166.76.133:2380,default=http://188.166.76.108:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"

$ ssh core@188.166.76.133 # master-03
core@master-03 ~ $ sudo vi /etc/kubernetes/manifests/etcd.yaml
# Add the following as args:
--name=master-03
--initial-cluster=member-02=http://188.166.29.53:2380,master-03=http://188.166.76.133:2380,default=http://188.166.76.108:2380
--initial-cluster-state=existing
core@master-03 ~ $ sudo systemctl start kubelet

So now we should have a 3-node etcd cluster.

Now lets master-02 and master-03 join the k8s cluster:

$ ssh core@188.166.29.53 # master-02
core@master-02 ~ $ sudo rm /etc/kubernetes/pki/ca.crt /etc/kubernetes/kubelet.conf
core@master-02 ~ $ sudo kubeadm join --token b11224.fada30ef8a7cbd38 188.166.76.108:6443 --discovery-token-ca-cert-hash sha256:19d34ff6e69203a799ab5984a212684b3dcd446ca5e9d6f6c1a8ae422583b62a
$ ssh core@188.166.76.133 # master-03
core@master-03 ~ $ sudo rm /etc/kubernetes/pki/ca.crt /etc/kubernetes/kubelet.conf
core@master-03 ~ $ sudo kubeadm join --token b11224.fada30ef8a7cbd38 188.166.76.108:6443 --discovery-token-ca-cert-hash sha256:19d34ff6e69203a799ab5984a212684b3dcd446ca5e9d6f6c1a8ae422583b62a

Mark them as masters:

core@master-01 ~ $ sudo kubeadm alpha phase mark-master --node-name master-02
core@master-01 ~ $ sudo kubeadm alpha phase mark-master --node-name master-03

Change kubelet,kube-scheduler and kube-controller-manager to use the local apiserver instead of master-01 apiserver:

core@master-01 ~ $ sudo sed 's/188.166.76.108/127.0.0.1/g' -i /etc/kubernetes/{scheduler.conf,kubelet.conf,controller-manager.conf}
core@master-02 ~ $ sudo sed 's/188.166.76.108/127.0.0.1/g' -i /etc/kubernetes/{scheduler.conf,kubelet.conf,controller-manager.conf}
core@master-03 ~ $ sudo sed 's/188.166.76.108/127.0.0.1/g' -i /etc/kubernetes/{scheduler.conf,kubelet.conf,controller-manager.conf}

Change kube-apiserver yaml file to advertise the correct ip and health checking ip:

core@master-02 ~ $ sudo sed 's/188.166.76.108/188.166.29.53/g' -i /etc/kubernetes/manifests/kube-apiserver.yaml
core@master-03 ~ $ sudo sed 's/188.166.76.108/188.166.76.133/g' -i /etc/kubernetes/manifests/kube-apiserver.yaml

Enable kubelet, docker and reboot:

core@master-01 ~ $ sudo systemctl enable kubelet docker; sudo reboot
core@master-02 ~ $ sudo systemctl enable kubelet docker; sudo reboot
core@master-03 ~ $ sudo systemctl enable kubelet docker; sudo reboot

Change kube-proxy to use the apiserver on localhost:

core@master-01 ~ $ sudo kubectl --kubeconfig=/etc/kubernetes/admin.conf -n kube-system edit configmap kube-proxy
# Change server: https://<ip>:6443 to https://127.0.0.1:6443

Now lets try adding a worker node (run the script at the top):
worker-01: 178.62.216.244

$ ssh core@178.62.216.244
core@worker-01 ~ $ sudo iptables -t nat -I OUTPUT -p tcp -o lo --dport 6443 -j DNAT --to 188.166.76.108
core@worker-01 ~ $ sudo iptables -t nat -I POSTROUTING -o eth0 -j SNAT --to-source $(curl -s ipinfo.io | jq -r .ip)
core@worker-01 ~ $ sudo sysctl net.ipv4.conf.eth0.route_localnet=1
core@worker-01 ~ $ sudo kubeadm join --token b11224.fada30ef8a7cbd38 127.0.0.1:6443 --discovery-token-ca-cert-hash sha256:19d34ff6e69203a799ab5984a212684b3dcd446ca5e9d6f6c1a8ae422583b62a
core@worker-01 ~ $ sudo systemctl enable kubelet docker

Now we just need to add a local loadbalancer to the worker node, and everything is done.
Save the following as /etc/nginx/nginx.conf on the worker-01 node:

error_log stderr notice;

worker_processes auto;
events {
	use epoll;
	worker_connections 1024;
}

stream {
	upstream kube_apiserver {
		least_conn;
		server 188.166.76.108:6443 max_fails=3 fail_timeout=30s;
		server 188.166.29.53:6443 max_fails=3 fail_timeout=30s;
		server 188.166.76.133:6443 max_fails=3 fail_timeout=30s;
	}

	server {
		listen 127.0.0.1:6443 reuseport;
		proxy_pass kube_apiserver;
		proxy_timeout 10m;
		proxy_connect_timeout 1s;

	}
}

Create /etc/kubernetes/manifests

core@worker-01 ~ $ sudo mkdir /etc/kubernetes/manifests

Add a static nginx-proxy manifest as /etc/kubernetes/manifests/nginx-proxy.yaml:

apiVersion: v1
kind: Pod
metadata:
  name: nginx-proxy
  namespace: kube-system
  labels:
    k8s-app: kube-nginx
spec:
  hostNetwork: true
  containers:
  - name: nginx-proxy
    image: nginx:1.13-alpine
    imagePullPolicy: Always
    resources:
      limits:
        cpu: 200m
        memory: 128M
      requests:
        cpu: 50m
        memory: 32M
    volumeMounts:
    - mountPath: /etc/nginx
      name: etc-nginx
      readOnly: true
  volumes:
  - name: etc-nginx
    hostPath:
      path: /etc/nginx

Reboot the node and the temporary iptables rules should be gone, and everything should work as expected.


A long post, but it shows that it is doable :)

Edit: Forgot to change the API server for the worker node: sudo sed 's/188.166.76.108/127.0.0.1/g' -i /etc/kubernetes/{bootstrap-kubelet.conf,kubelet.conf}
Edit2: Should also change kubectl --kubeconfig=admin.conf -n kube-public get configmap cluster-info

@klausenbusk Great 🎉! If you want to carry/improve kubernetes/website#6458, feel free to send a PR with more details on what you did to help @jamiehannaford which is on vacation at the moment.

@klausenbusk , on the master-02 and master-03 , I don't understand how you were able to join? Since the /etc/kubernetes directory is not empty. Can you please clarify if there is a step missing?
Thanks.

@klausenbusk , on the master-02 and master-03 , I don't understand how you were able to join? Since the /etc/kubernetes directory is not empty. Can you please clarify if there is a step missing?

I did remove sudo rm /etc/kubernetes/pki/ca.crt /etc/kubernetes/kubelet.conf as documented, removing the whole directory wasn't needed.

To @discordianfish and others wanting to run a HA setup on AWS.

I did manage to get a HA setup to work with Amazon's ELB (despite it not having a single static IP address).

To get it to work, the following steps (in addition to @jamiehannaford's HA guide) need to be taken:

  • Since the ELB does not have a static IP address, we cannot use that as the apiserver advertise address. Instead, we let each master advertise its own private IP address.

    The down-side of this approach seems to be that apiservers will "fight" over the endpoint record, rewriting it every now and then (as can be seen via kubectl get endpoints) which, in turn, has consequences for kube-proxy, which will rewrite its iptables whenever a change is detected.

    This doesn't appear to harm the correctness of Kubernetes, but I guess it can lead to some performance degradation in large clusters. Any thoughts?

    The issue is discussed in greater detail here.

  • All worker kubelets and kube-proxies need to access the API servers via the load-balancers FQDN. Since kubeadm doesn't allow us to specify different servers for kube-proxy and worker kubelets (they will simply use the IP address of the apiserver that they happended to connect to at kubeadm join)
    we need to take care of this ourselves.

    • The kube-proxy configuration is stored as a configmap, which gets overwritten every time kubeadm init is run (once for every master node). Therefore, for each kubeadm init we need to patch the configmap as follows:

      kubectl get configmap -n kube-system kube-proxy -o yaml > kube-proxy.cm
      sudo sed -i 's#server:.*#server: https://<masterLoadBalancerFQDN>:6443#g' kube-proxy.cm
      kubectl apply -f kube-proxy.cm --force
      # restart all kube-proxy pods to ensure that they load the new configmap
      kubectl delete pod -n kube-system -l k8s-app=kube-proxy
      
    • On each worker we need to patch the kubelet configuration after join, so that the kubelet connects via the load-balancer.

      sudo kubeadm join --config=kubeadm-config.yaml
      # /etc/kubernetes/kubelet.conf may not be immediately present
      wait_for 60 [ -f /etc/kubernetes/kubelet.conf ]
      sudo sed -i 's#server:.*#server: https://<masterLoadBalancerFQDN>:6443#g' /etc/kubernetes/kubelet.conf
      sudo systemctl restart kubelet
      

With this approach I seem to have a working cluster where one master at a time can go down without (apiserver) service disruption.

This doesn't appear to harm the correctness of Kubernetes, but I guess it can lead to some performance degradation in large clusters. Any thoughts?

You can switch to the new lease reconciler in 1.9, it should fix the "fighting" over the endpoint issue.

Excellent advice @klausenbusk. It worked like a charm.

@petergardfjall

they will simply use the IP address of the apiserver that they happended to connect to at kubeadm join

What happens if you do kubeadm join with the LB's IP?

In terms of the kubelet, I think that's a necessary manual edit. Need to add to HA guide.

@jamiehannaford The problem when using Amazon's ELB is that it doesn't provide a single, stable IP address, so there is no such LB IP that I can make use of (see https://stackoverflow.com/a/35317682/7131191).

So for now the workers join via the ELB's FQDN, which will forward it to one of the apiservers, which, since it advertises its own IP address, makes the worker configure its kubelet to use that IP address (and not the ELB FQDN). Therefore, to make sure that the kubelet goes through the apiserver load-balancer the kubelet.conf needs to be patched afterwards with the ELB FQDN and the kubelet restarted.

I've just open sourced out stab on HA kubeadm. Comes with a few caveats and ugly workaround (especially the kube-proxy hack is ugly). But it works: https://github.com/itskoko/kubecfn

I have done some work on the HA setup guide on google docs:

Those changes have been implemented in my ansible-based automation of the described process, plus some more:

  • automatic setup of etcd-operator for applications running in the cluster (not the cluster itself)
  • prefetching of images needed for Kubernetes operation and copying them to the cluster hosts
  • Dashboard setup (insecure, without SSL) with port 30990 on NodePort (if no LB is configured)

I've published the kubeadm-based HA kubernetes installer script I've been working on lately. It will hopefully put my prior comments into context and serve as one concrete example of how to automate the steps of @jamiehannaford's HA guide, which it follows fairly closely.

It's a python script that executes in two phases: render which creates "cluster assets" in the form of SSH keys, certs, and bootscripts, and an install phase which executes those bootscripts over SSH.

The scripts have been tried out on a local Vagrant cluster and against AWS. Two "infrastructure provider scripts" are included in the repo (vagrant and AWS via Terraform) to provision the necessary cluster load-balancer and VMs.

Feel free to try it out. https://github.com/elastisys/hakube-installer

I have not yet found a way to upgrade a HA cluster installed using kubeadm and the manual steps described in my HA setup guide on google docs.

What I have tried so far is the following:

  1. Shut down keepalived on the secondary masters, run kubeadm upgrade on the primary master, apply the same changes in /etc/kubernetes/manifests on the secondary masters as there were on the primary master and start keepalived on the secondary masters.
  2. Same like (1), but in addition to keepalived, also shut down (and later start) kubelet and docker on the secondary masters.
  3. Same like (2), but before applying the upgrade on the primary master, cordon (and later uncordon) all secondary masters.

This did not work, and the result was pretty much the same in all cases. What I get in the secondary masters' logs looks like this:

Unable to register node "master-2.mylan.local" with API server: nodes "master-2.mylan.local" is forbidden: node "master-1.mylan.local" cannot modify node "master-2.mylan.local"

Failed to update status for pod "kube-apiserver-master-2.mylan.local_kube-system(6d84ab47-0008-11e8-a558-0050568a9775)": pods "kube-apiserver-master-2.mylan.local" is forbidden: node "master-1.mylan.local" can only update pod status for pods with spec.nodeName set to itself

Failed to update status for pod "kube-controller-manager-master-2.mylan.local_kube-system(665da2db-0008-11e8-a558-0050568a9775)": pods "kube-controller-manager-master-2.mylan.local" is forbidden: node "master-1.mylan.local" can only update pod status for pods with spec.nodeName set to itself

Failed to update status for pod "kube-scheduler-master-2.mylan.local_kube-system(65c6a0b3-0008-11e8-a558-0050568a9775)": pods "kube-scheduler-master-2.mylan.local" is forbidden: node "master-1.mylan.local" can only update pod status for pods with spec.nodeName set to itself

Failed to update status for pod "kube-flannel-ds-ch8gq_kube-system(47cccaea-0008-11e8-b5b5-0050568a9e45)": pods "kube-flannel-ds-ch8gq" is forbidden: node "master-1.mylan.local" can only update pod status for pods with spec.nodeName set to itself

Failed to update status for pod "kube-proxy-htzg7_kube-system(47cc9d00-0008-11e8-b5b5-0050568a9e45)": pods "kube-proxy-htzg7" is forbidden: node "master-1.mylan.local" can only update pod status for pods with spec.nodeName set to itself

Deleting mirror pod "kube-controller-manager-master-2.mylan.local_kube-system(665da2db-0008-11e8-a558-0050568a9775)" because it is outdated

Failed deleting a mirror pod "kube-controller-manager-master-2.mylan.local_kube-system": pods "kube-controller-manager-master-2.mylan.local" is forbidden: node "master-1.mylan.local" can only delete pods with spec.nodeName set to itself

Failed creating a mirror pod for "kube-controller-manager-master-2.mylan.local_kube-system(78432ebfe5d8dfbb93f8173decf3447e)": pods "kube-controller-manager-master-2.mylan.local" is forbidden: node "master-1.mylan.local" can only create pods with spec.nodeName set to itself

[... and so forth, repeats itself ...]

Has anybody got a hint how to proceed in getting the secondary masters upgraded cleanly?

@mbert This seems like an RBAC issue. Did you ensure the node name matches the hostname-override?

Also, did you reset etcd for each step? That probably explains why you saw the same result.

@jamiehannaford I am not using any hostname override, neither in kubelet nor in the kubeadm init configuration. And, yes, I am resetting etcd, i.e. tear down the cluster, install a new one from the scratch, then try to upgrade it.

I'll give setting a hostname-override for kubelet a shot and see whether this leads to any other result.

It seems like setting hostname-override when setting up the cluster helps, i.e., makes the secondary masters upgradable. Once this has become a standardised procedure I will document it in the HA setup guide in google docs.

Hi @mbert and others - From the past year or so, I have several k8s clusters (kubeadm and otherwise) driven from Cobbler / Puppet on CoreOS and CentOS. However, none of these has been HA.

My next task is to integrate K8s HA and I want to use kubeadm. I'm unsure whether to go with the @mbert's HA setup guide or @jamiehannaford's HA guide.

Also - this morning I read @timothysc's Proposal for a highly available control plane configuration for ‘kubeadm’ deployments. and I like the "initial etcd seed" approach he outlines. However, I don't see that same approach in either @mbert or @jamiehannaford's work. @mbert appears to use a single, k8s-hosted etcd while @jamiehannaford's document documents the classic approach of external etcd (which is exactly what I have used for my other non-HA POC efforts).

What do you all recommend? External etcd, single self-hosted, or locating and using the "seed" etcd (with pivot to k8s-hosted)? If the last - what guide or documentation do you suggest?

TIA!

@andybrucenet External etcd is recommended for HA setups (at least at this moment in time). CoreOS has recently dropped support for any kind of self-hosted. It should only really be used for dev, staging or casual clusters.

@andybrucenet Not quite - I am using an external etcd cluster just like @jamiehannaford proposes in his guide. Actually the approaches described in our respective documents should be fairly similar. It is based on setting up the etcd cluster you feel you need and then have kubeadm use it when bootstrapping the Kubernetes cluster.

I am currently more or less about to finish my guide and the ansible-based implementation by documenting and implementing a working upgrade procedure - that (and some bugfixes) should be done sometime next week.

Not quite sure whether there will be any need to further transfer my guide into yours, @jamiehannaford what do you think?

Actually the hostname-override was unnecessary. When running kubeadm upgrade apply, some default settings overwrite my adaptations, e.g. NodeRestriction gets re-activated (also my scaling of Kube DNS instances gets reset, but this was of course not a show stopper here). Patching the NodeRestriction admission rule out of /etc/kubernetes/manifests/kube-apiserver.yaml did the trick.

I have now written a chapter on upgrading HA clusters to my HA setup guide.

Also I have added code for automating this process to my ansible project on github. Take a look into the README.md file there for more information.

@mbert for the upgrade process you've outlined, what are the exact reasons for manually copying the configs and manifests from /etc/kubernetes on the primary master to the secondary masters rather than simply running kubeadm upgrade apply <version> on the secondary masters as well?

@mattkelly It seemed rather dangerous to me.
Since the HA cluster's masters use an active/passive setup, but kubeadm knows about only one master I found running it again on a different master risky.
I may be wrong though.

Replying to myself: Having looked at Jamie's guide on kubernetes.io, running kubeadm on the masters may work, even when setting up the cluster. I'll try this out next week and probably make some changes to my documents accordingly.

FWIW, running kubeadm on the secondary masters seems to have worked just fine for me (including upgrade) - but I need to better understand the exact risks at each stage. I've been following @jamiehannaford's guide which is automated by @petergardfjall's hakube-installer (no upgrade support yet though, so I tested that manually).

Edit: Also important to note is that I'm only testing on v1.9+. Upgrade was from v1.9.0 to v1.9.2.

I have now followed the guide on kubernetes.io that @jamiehannaford created, i.e. ran kubeadm init on the all master machines (after having copied /etc/kubernetes/pki/ca.* to the secondary masters). This works just fine for setting up the cluster. In order to be able to upgrade to v1.9.2 I am setting up v1.8.3 here.

Now I am running into trouble when trying to upgrade the cluster: Running kubeadm upgrade apply v1.9.2 on the first master fails:

[controlplane] Wrote Static Pod manifest for component kube-controller-manager to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests872757515/kube-controller-manager.yaml"
[controlplane] Wrote Static Pod manifest for component kube-scheduler to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests872757515/kube-scheduler.yaml"
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-apiserver.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests647361774/kube-apiserver.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/apply] FATAL: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: [timed out waiting for the condition]

This step fails reproducably (I always start from the scratch, i.e. remove all configuration files plus etcd data from all nodes before starting a new setup).

I tried out several variations, but no success:

  • Have kubelet use the local API Server instance or the one pointed to by the virtual IP
  • have kube-proxy use the local API Server instance or the one pointed to by the virtual IP

I have attached some logs. However I cannot really find any common pattern that would explain this problem to me. Maybe it is something I just don't know?

upgrade-failed-proxy-on-vip.log
upgrade-failed-proxy-and-kubelet-on-vip.log
upgrade-failed-proxy-and-kubelet-on-local-ip.log

Having tried out another few things it boils down to the following:

  • Updating the master which was setup last (i.e. the one on which kubeadm init was run last when setting up the cluster) works.
  • I can get the other nodes working, too, if I edit configmap/kubeadm-config and change the value for MasterConfiguration.nodeName in there to the respective master's host name or simply delete that line.

Others like @mattkelly have been able to perform the upgrade without editing configmap/kubeadm-config, hence the way I set things up must be somehow different.

Anybody got a clue what I should change, so that upgrading works without this (rather dirty) trick?

I have tried upgrading from both 1.8.3 and 1.9.0 to 1.9.2, with the same result.

@mbert I'm now reproducing your issue from a fresh v1.9.0 cluster created using hakube-installer. Trying to upgrade to v1.9.3. I can't think of anything that has changed with my workflow. I'll try to figure it out today.

I verified that deleting the nodeName line from configmap/kubeadm-config for each subsequent fixes the issue.

Thank you, that's very helpful. I have now added patching configmap/kubeadm-config to my instructions.

@mbert oops, I figured out the difference :). For previous upgrades I had been providing the config generated during setup via --config (muscle memory I guess). This is why I never needed the workaround. I believe that your workaround is more correct in case the cluster has changed since init time. It would be great to figure out how to avoid that hack, but it's not too bad in the meantime - especially compared to all of the other workarounds.

Hello,
Will kubeadm 1.10 remove any of the pre-steps/workarounds currently required for HA in 1.9 ?
E.g. the manual creation of a bootstrap etcd, generation of etcd keys, etc?

Closing this item as 1.10 doc is out and we will be moving to further the HA story in 1.11

/cc @fabriziopandini