cluster start problem
edoziw opened this issue · comments
Hi Tim, thanks for your videos, repos, and docs!!
- tldr: Updates: see comments below
- narrowed down to symptom kube-vip arp problem
- cause: apiserver_endpoint and metal_lb_ip_range were not in flannel_iface nics ip range
ssh <master_ip>
ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.0.184 netmask 255.255.255.0 broadcast 192.168.0.255
.....
I was using apiserver_endpoint: "192.168.30.222" instead of apiserver_endpoint: "192.168.0.222"
same problem in metal_lb_ip_range, correct for me was metal_lb_ip_range: "192.168.0.200-192.168.0.210"
Description
Problem similar ?? to post by @AshDevFr in #372_ ??
First I checked this: #20
, but I wasn't able to figure out how to fix my problem
On a set of two raspberry pies, (hosts.ini, hardware, and OS below )
Expected Behavior
after ansible-playbook site.yml
- it should not hang
- kubectl get nodes should work
Current Behavior
When running the playbook site.yml, it hangs at the k3s_agent : Enable and check K3s service
task.
I tried a bunch of times to reset
and re-run but without success.
I've check all the nodes and I have eth0 everywhere.
Also my token is correct
Steps to Reproduce
- Clone the project https://github.com/techno-tim/k3s-ansible.git
commit: 70ddf7b - Update ansible.cfg, inventory/pi-cluster/hosts.ini, inventory/pi-cluster/group_vars/all.yml
- Run ansible-playbook site.yml
Context (variables)
Operating system:
master: Raspbian GNU/Linux 11 (bullseye) 192.168.0.100 eth0
node: Raspbian GNU/Linux 11 (bullseye) 192.168.0.184 eth0
Hardware:
master: Raspberry Pi 4 Model B Rev 1.5; df.Avail=50G; mem=7.6Gi
node: Raspberry Pi 2 Model B Rev 1.1; df.Avail=4430932 (4.3G); mem=944104 (921Mi) (I know this is small but master should still start
Variables Used
inventory/pi-cluster/group_vars/all.yml
k3s_version: v1.25.12+k3s1
ansible_user: pi # I ensured that these vars were being used by temp changing this to 'piff' and I as expected, I could NOT connect to 192.168.0.184
systemd_dir: /etc/systemd/system
flannel_iface: "eth0"
apiserver_endpoint: "10.193.20.10"
k3s_token: "sKcyohCecVULptzpvatzHrYagPGL4mfN"
extra_args: >-
--flannel-iface={{ flannel_iface }}
--node-ip={{ k3s_node_ip }}
extra_server_args: >-
{{ extra_args }}
{{ '--node-taint node-role.kubernetes.io/master=true:NoSchedule' if k3s_master_taint else '' }}
--tls-san {{ apiserver_endpoint }}
--disable servicelb
--disable traefik
extra_agent_args: >-
{{ extra_args }}
kube_vip_tag_version: "v0.5.12"
metal_lb_speaker_tag_version: "v0.13.9"
metal_lb_controller_tag_version: "v0.13.9"
metal_lb_ip_range: "10.193.20.20-10.193.20.99"
diff inventory/pi-cluster/group_vars/all.yml inventory/sample/group_vars/all.yml
4c4
< ansible_user: pi
---
> ansible_user: ansibleuser
8c8
< system_timezone: "Etc/UTC"
---
> system_timezone: "Your/Timezone"
18c18
< k3s_token: "7ZATLZ3NUJ9sssecretsssFSH9XXLF54J"
---
> k3s_token: "some-SUPER-DEDEUPER-secret-password"
44c44
< kube_vip_tag_version: "v0.6.2"
---
> kube_vip_tag_version: "v0.5.12"
58,59c58,59
< metal_lb_speaker_tag_version: "v0.13.11"
< metal_lb_controller_tag_version: "v0.13.11"
---
> metal_lb_speaker_tag_version: "v0.13.9"
> metal_lb_controller_tag_version: "v0.13.9"
note I also tried:
- metal_lb version in the origional commit (v0.13.9 vs v0.13.11)
- kube_vip version (v0.5.12 vs v0.6.2)
Hosts
ansible.cfg
[defaults]
inventory = inventory/pi-cluster/hosts.ini
inventory/pi-cluster/hosts.ini
[master]
192.168.0.100 ansible_user=cringe
[node]
192.168.0.184
# only required if proxmox_lxc_configure: true
# must contain all proxmox instances that have a master or worker node
# [proxmox]
# 192.168.30.43
[k3s_cluster:children]
master
node
Logs
On the ops host
$ ansible-playbook reset.yml
PLAY RECAP *************************************************************************************************************
192.168.0.100 : ok=27 changed=2 unreachable=0 failed=0 skipped=12 rescued=0 ignored=0
192.168.0.184 : ok=27 changed=2 unreachable=0 failed=0 skipped=12 rescued=0 ignored=0
$ ansible-playbook site.yml
[WARNING]: Could not match supplied host pattern, ignoring: proxmox
PLAY [Prepare Proxmox cluster] *****************************************************************************************
skipping: no hosts matched
PLAY [Prepare k3s nodes] ***********************************************************************************************
TASK [Gathering Facts] *************************************************************************************************
ok: [192.168.0.100]
ok: [192.168.0.184]
# other ok's ......<snipped for brevity>....
TASK [download : Download k3s binary arm64] ****************************************************************************
skipping: [192.168.0.184]
changed: [192.168.0.100]
TASK [download : Download k3s binary armhf] ****************************************************************************
skipping: [192.168.0.100]
changed: [192.168.0.184]
# ...
TASK [raspberrypi : Flush iptables before changing to iptables-legacy] *************************************************
changed: [192.168.0.100]
changed: [192.168.0.184]
TASK [raspberrypi : Changing to iptables-legacy] ***********************************************************************
ok: [192.168.0.100]
ok: [192.168.0.184]
TASK [raspberrypi : Changing to ip6tables-legacy] **********************************************************************
ok: [192.168.0.100]
ok: [192.168.0.184]
PLAY [Setup k3s servers] ***********************************************************************************************
TASK [Gathering Facts] *************************************************************************************************
ok: [192.168.0.100]
TASK [k3s_server : Stop k3s-init] **************************************************************************************
ok: [192.168.0.100]
TASK [k3s_server : Clean previous runs of k3s-init] ********************************************************************
ok: [192.168.0.100]
TASK [k3s_server : Deploy vip manifest] ********************************************************************************
included: /home/npi/dev/k3s/k3s-ansible-tim/roles/k3s_server/tasks/vip.yml for 192.168.0.100
TASK [k3s_server : Create manifests directory on first master] *********************************************************
changed: [192.168.0.100]
TASK [k3s_server : Download vip rbac manifest to first master] *********************************************************
changed: [192.168.0.100]
TASK [k3s_server : Copy vip manifest to first master] ******************************************************************
changed: [192.168.0.100]
TASK [k3s_server : Deploy metallb manifest] ****************************************************************************
included: /home/npi/dev/k3s/k3s-ansible-tim/roles/k3s_server/tasks/metallb.yml for 192.168.0.100
TASK [k3s_server : Create manifests directory on first master] *********************************************************
ok: [192.168.0.100]
TASK [k3s_server : Download to first master: manifest for metallb-native] **********************************************
changed: [192.168.0.100]
TASK [k3s_server : Set image versions in manifest for metallb-native] **************************************************
ok: [192.168.0.100] => (item=metallb/speaker:v0.13.11 => metallb/speaker:v0.13.11)
TASK [k3s_server : Init cluster inside the transient k3s-init service] *************************************************
changed: [192.168.0.100]
TASK [k3s_server : Verify that all nodes actually joined (check k3s-init.service if this fails)] ***********************
FAILED - RETRYING: [192.168.0.100]: Verify that all nodes actually joined (check k3s-init.service if this fails) (20 retries left).
FAILED - RETRYING: [192.168.0.100]: Verify that all nodes actually joined (check k3s-init.service if this fails) (19 retries left).
FAILED - RETRYING: [192.168.0.100]: Verify that all nodes actually joined (check k3s-init.service if this fails) (18 retries left).
ok: [192.168.0.100]
TASK [k3s_server : Save logs of k3s-init.service] **********************************************************************
skipping: [192.168.0.100]
TASK [k3s_server : Kill the temporary service used for initialization] *************************************************
changed: [192.168.0.100]
TASK [k3s_server : Copy K3s service file] ******************************************************************************
changed: [192.168.0.100]
TASK [k3s_server : Enable and check K3s service] ***********************************************************************
changed: [192.168.0.100]
TASK [k3s_server : Wait for node-token] ********************************************************************************
ok: [192.168.0.100]
TASK [k3s_server : Register node-token file access mode] ***************************************************************
ok: [192.168.0.100]
TASK [k3s_server : Change file access node-token] **********************************************************************
changed: [192.168.0.100]
TASK [k3s_server : Read node-token from master] ************************************************************************
ok: [192.168.0.100]
TASK [k3s_server : Store Master node-token] ****************************************************************************
ok: [192.168.0.100]
TASK [k3s_server : Restore node-token file access] *********************************************************************
changed: [192.168.0.100]
TASK [k3s_server : Create directory .kube] *****************************************************************************
ok: [192.168.0.100]
TASK [k3s_server : Copy config file to user home directory] ************************************************************
changed: [192.168.0.100]
TASK [k3s_server : Configure kubectl cluster to https://192.168.30.222:6443] *******************************************
changed: [192.168.0.100]
TASK [k3s_server : Create kubectl symlink] *****************************************************************************
ok: [192.168.0.100]
TASK [k3s_server : Create crictl symlink] ******************************************************************************
ok: [192.168.0.100]
TASK [k3s_server : Get contents of manifests folder] *******************************************************************
ok: [192.168.0.100]
TASK [k3s_server : Get sub dirs of manifests folder] *******************************************************************
ok: [192.168.0.100]
TASK [k3s_server : Remove manifests and folders that are only needed for bootstrapping cluster so k3s doesn't auto apply on start] ***
changed: [192.168.0.100] => (item=/var/lib/rancher/k3s/server/manifests/vip.yaml)
changed: [192.168.0.100] => (item=/var/lib/rancher/k3s/server/manifests/vip-rbac.yaml)
changed: [192.168.0.100] => (item=/var/lib/rancher/k3s/server/manifests/local-storage.yaml)
changed: [192.168.0.100] => (item=/var/lib/rancher/k3s/server/manifests/metallb-crds.yaml)
changed: [192.168.0.100] => (item=/var/lib/rancher/k3s/server/manifests/ccm.yaml)
changed: [192.168.0.100] => (item=/var/lib/rancher/k3s/server/manifests/rolebindings.yaml)
changed: [192.168.0.100] => (item=/var/lib/rancher/k3s/server/manifests/coredns.yaml)
changed: [192.168.0.100] => (item=/var/lib/rancher/k3s/server/manifests/metrics-server)
PLAY [Setup k3s agents] ************************************************************************************************
TASK [Gathering Facts] *************************************************************************************************
ok: [192.168.0.184]
TASK [k3s_agent : Copy K3s service file] *******************************************************************************
changed: [192.168.0.184]
TASK [k3s_agent : Enable and check K3s service] **
# hangs here ^^^ #########################
On the master node
sudo journalctl -u k3s > /tmp/k3s.log; grep -E --line-number 'arting Light' /tmp/k3s.log # to find the last start log; tail -n +2277270 /tmp/k3s.log > /tmp/k3s-1642.log
k3s-1642.log
On the worker node
(I know the times don't match, I didn't feel like finding the tail of the master logs again, this is the same symptoms as on Oct 03 for the worker)
Oct 04 17:12:19 nr systemd[1]: Starting Lightweight Kubernetes...
Oct 04 17:12:19 nr k3s[1147]: time="2023-10-04T17:12:19Z" level=info msg="Acquiring lock file /var/lib/rancher/k3s/data/.lock"
Oct 04 17:12:19 nr k3s[1147]: time="2023-10-04T17:12:19Z" level=info msg="Preparing data dir /var/lib/rancher/k3s/data/e97cb72712a2a0cfd96fd00d8839059b2dbc8c35357aba1cf7248fc5a7e6db66"
Oct 04 17:13:03 nr k3s[1147]: time="2023-10-04T17:13:03Z" level=info msg="Starting k3s agent v1.25.12+k3s1 (7515237f)"
Oct 04 17:13:03 nr k3s[1147]: time="2023-10-04T17:13:03Z" level=info msg="Adding server to load balancer k3s-agent-load-balancer: 192.168.30.222:6443"
Oct 04 17:13:03 nr k3s[1147]: time="2023-10-04T17:13:03Z" level=info msg="Running load balancer k3s-agent-load-balancer 127.0.0.1:6444 -> [192.168.30.222:6443] [default: 192.168.30.222:6443]"
Oct 04 17:13:23 nr k3s[1147]: time="2023-10-04T17:13:23Z" level=error msg="failed to get CA certs: Get \"https://127.0.0.1:6444/cacerts\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
# last error repeats several times
update, It seems the master node is running (I can kubectl in its ssh session, and curl the kube-vip ip for the ${apiserver_endpoint}, its just that the kube-vip arp broadcast is failing to get to:
- the worker (k0) node
- my dev machine (dev) where I want to run kubectl.
I'll continue to investigate
Let me know if you have any ideas or trouble shooting tips for what could be stopping arp -> k0, dev machines
Thanks
steps
ssh km
container_id="$(crictl ps -a | grep kube-vip | awk '{print $1}')"
crictl logs "${container_id}"
ping -D 192.168.30.222
curl -v -k https://192.168.30.222:6443/cacerts
output
time="2023-10-05T10:06:31Z" level=info msg="Starting kube-vip.io [v0.6.2]"
time="2023-10-05T10:06:31Z" level=info msg="namespace [kube-system], Mode: [ARP], Features(s): Control Plane:[true], Services:[false]"
time="2023-10-05T10:06:31Z" level=info msg="No interface is specified for VIP in config, auto-detecting default Interface"
time="2023-10-05T10:06:31Z" level=info msg="prometheus HTTP server started"
time="2023-10-05T10:06:31Z" level=info msg="kube-vip will bind to interface [eth0]"
time="2023-10-05T10:06:31Z" level=info msg="Starting Kube-vip Manager with the ARP engine"
time="2023-10-05T10:06:31Z" level=info msg="Beginning cluster membership, namespace [kube-system], lock name [plndr-cp-lock], id [km]"
I1005 10:06:31.269258 1 leaderelection.go:245] attempting to acquire leader lease kube-system/plndr-cp-lock...
I1005 10:06:31.342413 1 leaderelection.go:255] successfully acquired lease kube-system/plndr-cp-lock
time="2023-10-05T10:06:31Z" level=info msg="Node [km] is assuming leadership of the cluster"
time="2023-10-05T10:06:31Z" level=info msg="Gratuitous Arp broadcast will repeat every 3 seconds for [192.168.30.222]"
###
PING 192.168.30.222 (192.168.30.222) 56(84) bytes of data.
[1696504936.602201] 64 bytes from 192.168.30.222: icmp_seq=1 ttl=64 time=0.201 ms
### curl output
...
HTTP/2 200
... cert output ... ok
ssh k0
ping -D 192.168.30.222 -w 2
curl -v -k https://192.168.30.222:6443/cacerts --connect-timeout 3
output
PING 192.168.30.222 (192.168.30.222) 56(84) bytes of data.
--- 192.168.30.222 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1029ms
### curl output
* Trying 192.168.30.222:6443...
* Connection timed out after 3001 milliseconds
* Closing connection 0
curl: (28) Connection timed out after 3001 milliseconds
user error:
after some diagnosis I found a symptom of kube vip 's arp broadcast not being registered by the worker node
ssh <worker node ip>
sudo tcpdump -ni eth0 arp
# broadcast of 192.168.30.222 d8:3a:dd:34:c6:62 was being received
# manual set failed
arp -v ether -i eth0 -s 192.168.30.222 d8:3a:dd:34:c6:62
ether: Unknown host
See tldr section in initial comment
Sorry for the confusion..
Maybe add in the README.md or the trouble shooting to make sure metal_lb_ip_range
and apiserver_endpoint
have valid ips in flannel_iface
or assert in the play somewhere?