Worker nodes fail to grab CA Certs, playbook stuck at the task "[k3s_agent: Enable and check K3s service]"

Question

Worker nodes fail to grab CA Certs, playbook stuck at the task "[k3s_agent: Enable and check K3s service]"

untraceablez opened this issue 7 months ago · comments

When running with a fresh copy of the repo, only modifications being k3s-token and the hosts.ini file, the playbook hangs at the k3s_agent: Enable and check K3S service task, and never completes.

Expected Behavior

The playbook should run all the way through and join the worker nodes to the cluster.

Current Behavior

After modifying the hosts.ini file to point to my 3 control VMs on Proxmox (non-LXC) and my 6 raspberry pi 4Bs, I generated an alphanumeric token (truly alphanumeric, 0 symbols of any kind.)

I then ran the following command:

ansible-playbook site.yml -i inventory/untraceablez/hosts.ini and the playbook will run through most of the tasks until arriving at the k3s_agent: Enable and check K3s service] task. Here, it will hang in perpetuity until manually killed.

Steps to Reproduce

Clone a fresh copy of the repo
Modify the hosts.ini file to point to 3 control nodes, 6 worker nodes. The control nodes are Ubuntu 22.04 VMs, and the worker nodes are all Raspberry Pi 4Bs running Raspbian (Bullseye version)
Run ansible-playbook site.yml -i inventory/mycluster/hosts.ini
Playbook will fail at the k3s_agent: Enable and check K3s service task.

Context (variables)

Operating system:

3 Virtual Machines on Proxmox, all running Ubuntu 22.04.
6 Raspberry Pi 4Bs, running Raspbian based on Debian Bullseye (Debian 11)

Hardware:

3 VMs have the following:
- CPU: 4 Cores
- RAM: 8GB
- Storage: 64GB SSD
6 Raspberry Pi 4Bs are all 4GB model except for 1 8GB model.
- 8GB model has a 256GB SD card
- the 5 4GB models each have a 32GB SD card

Variables Used

all.yml

---
k3s_version: v1.25.12+k3s1
# this is the user that has ssh access to these machines
ansible_user: ansible
systemd_dir: /etc/systemd/system
ansible_ssh_private_key_file: ~/.ssh/ansible_control

# Set your timezone
system_timezone: "America/Chicago"

# interface which will be used for flannel
flannel_iface: "eth0"

# apiserver_endpoint is virtual ip-address which will be configured on each master
apiserver_endpoint: "192.168.30.222"

# k3s_token is required  masters can talk together securely
# this token should be alpha numeric only
k3s_token: "A4xH4LU0BqXCoGsyEEU5HCrZCDVnGg8H37FB"

# The IP on which the node is reachable in the cluster.
# Here, a sensible default is provided, you can still override
# it for each of your hosts, though.
k3s_node_ip: '{{ ansible_facts[flannel_iface]["ipv4"]["address"] }}'

# Disable the taint manually by setting: k3s_master_taint = false
k3s_master_taint: "{{ true if groups['node'] | default([]) | length >= 1 else false }}"

# these arguments are recommended for servers as well as agents:
extra_args: >-
  --flannel-iface={{ flannel_iface }}
  --node-ip={{ k3s_node_ip }}

# change these to your liking, the only required are: --disable servicelb, --tls-san {{ apiserver_endpoint }}
extra_server_args: >-
  {{ extra_args }}
  {{ '--node-taint node-role.kubernetes.io/master=true:NoSchedule' if k3s_master_taint else '' }}
  --tls-san {{ apiserver_endpoint }}
  --disable servicelb
  --disable traefik
extra_agent_args: >-
  {{ extra_args }}

# image tag for kube-vip
kube_vip_tag_version: "v0.5.12"

# metallb type frr or native
metal_lb_type: "native"

# metallb mode layer2 or bgp
metal_lb_mode: "layer2"

# bgp options
# metal_lb_bgp_my_asn: "64513"
# metal_lb_bgp_peer_asn: "64512"
# metal_lb_bgp_peer_address: "192.168.30.1"

# image tag for metal lb
metal_lb_speaker_tag_version: "v0.13.9"
metal_lb_controller_tag_version: "v0.13.9"

# metallb ip range for load balancer
metal_lb_ip_range: "192.168.30.80-192.168.30.90"

# Only enable if your nodes are proxmox LXC nodes, make sure to configure your proxmox nodes
# in your hosts.ini file.
# Please read https://gist.github.com/triangletodd/02f595cd4c0dc9aac5f7763ca2264185 before using this.
# Most notably, your containers must be privileged, and must not have nesting set to true.
# Please note this script disables most of the security of lxc containers, with the trade off being that lxc
# containers are significantly more resource efficent compared to full VMs.
# Mixing and matching VMs and lxc containers is not supported, ymmv if you want to do this.
# I would only really recommend using this if you have partiularly low powered proxmox nodes where the overhead of
# VMs would use a significant portion of your available resources.
proxmox_lxc_configure: false
# the user that you would use to ssh into the host, for example if you run ssh some-user@my-proxmox-host,
# set this value to some-user
proxmox_lxc_ssh_user: root
# the unique proxmox ids for all of the containers in the cluster, both worker and master nodes
proxmox_lxc_ct_ids:
  - 200
  - 201
  - 202
  - 203
  - 204

# Only enable this if you have set up your own container registry to act as a mirror / pull-through cache
# (harbor / nexus / docker's official registry / etc).
# Can be beneficial for larger dev/test environments (for example if you're getting rate limited by docker hub),
# or air-gapped environments where your nodes don't have internet access after the initial setup
# (which is still needed for downloading the k3s binary and such).
# k3s's documentation about private registries here: https://docs.k3s.io/installation/private-registry
custom_registries: false
# The registries can be authenticated or anonymous, depending on your registry server configuration.
# If they allow anonymous access, simply remove the following bit from custom_registries_yaml
#   configs:
#     "registry.domain.com":
#       auth:
#         username: yourusername
#         password: yourpassword
# The following is an example that pulls all images used in this playbook through your private registries.
# It also allows you to pull your own images from your private registry, without having to use imagePullSecrets
# in your deployments.
# If all you need is your own images and you don't care about caching the docker/quay/ghcr.io images,
# you can just remove those from the mirrors: section.
custom_registries_yaml: |
  mirrors:
    docker.io:
      endpoint:
        - "https://registry.domain.com/v2/dockerhub"
    quay.io:
      endpoint:
        - "https://registry.domain.com/v2/quayio"
    ghcr.io:
      endpoint:
        - "https://registry.domain.com/v2/ghcrio"
    registry.domain.com:
      endpoint:
        - "https://registry.domain.com"

  configs:
    "registry.domain.com":
      auth:
        username: yourusername
        password: yourpassword

# Only enable and configure these if you access the internet through a proxy
# proxy_env:
#   HTTP_PROXY: "http://proxy.domain.local:3128"
#   HTTPS_PROXY: "http://proxy.domain.local:3128"
#   NO_PROXY: "*.domain.local,127.0.0.0/8,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16"

Hosts

host.ini

[master]
10.0.0.6[1:3]

[node]
10.0.0.4[1:6]


# only required if proxmox_lxc_configure: true
# must contain all proxmox instances that have a master or worker node
# [proxmox]
# 192.168.30.43

[k3s_cluster:children]
master
node

Results of journalctl -uxe k3s-node.service

-- Journal begins at Tue 2023-12-05 00:31:03 CST, ends at Thu 2023-12-21 14:06:35 CST. --
Dec 21 13:58:18 k3s-worker-01 systemd[1]: Starting Lightweight Kubernetes...
░░ Subject: A start job for unit k3s-node.service has begun execution
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░ 
░░ A start job for unit k3s-node.service has begun execution.
░░ 
░░ The job identifier is 897.
Dec 21 13:58:18 k3s-worker-01 k3s[892]: time="2023-12-21T13:58:18-06:00" level=info msg="Acquiring lock file /var/lib/rancher/k3s/data/.lock"
Dec 21 13:58:18 k3s-worker-01 k3s[892]: time="2023-12-21T13:58:18-06:00" level=info msg="Preparing data dir /var/lib/rancher/k3s/data/a717c9efcc7e477937a460d97ec400b38160854e01fb91fdbf1cf8537c2ac392"
Dec 21 13:58:32 k3s-worker-01 k3s[892]: time="2023-12-21T13:58:32-06:00" level=info msg="Starting k3s agent v1.25.12+k3s1 (7515237f)"
Dec 21 13:58:33 k3s-worker-01 k3s[892]: time="2023-12-21T13:58:33-06:00" level=info msg="Adding server to load balancer k3s-agent-load-balancer: 192.168.30.222:6443"
Dec 21 13:58:33 k3s-worker-01 k3s[892]: time="2023-12-21T13:58:33-06:00" level=info msg="Running load balancer k3s-agent-load-balancer 127.0.0.1:6444 -> [192.168.30.222:6443] [default: 192.168.30.222:6443]"
Dec 21 13:58:53 k3s-worker-01 k3s[892]: time="2023-12-21T13:58:53-06:00" level=error msg="failed to get CA certs: Get \"https://127.0.0.1:6444/cacerts\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
Dec 21 13:59:15 k3s-worker-01 k3s[892]: time="2023-12-21T13:59:15-06:00" level=error msg="failed to get CA certs: Get \"https://127.0.0.1:6444/cacerts\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
Dec 21 13:59:37 k3s-worker-01 k3s[892]: time="2023-12-21T13:59:37-06:00" level=error msg="failed to get CA certs: Get \"https://127.0.0.1:6444/cacerts\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"

Possible Solution

I have run the reset.yml as well and rebooted the pis and the VMs, and tried running the site.yml again, but to no avail, I get the same result each time. I suspect there's a bug somewhere that's causing the worker nodes to point to localhost:6444 instead of the control nodes on port 6444, at least that's the error being indicated by the journalctl output.

[✅] I've checked the General Troubleshooting Guide

redmetablue · Answer 1 · Thu Dec 28 2023 09:25:05 GMT+0800 (China Standard Time)

I notice your inventory IPs are on a different range than metal_lb_ip_range and apiserver_endpoint. Is it possible your issue is related to tis discussion?

#158 (comment)

Taylor Cohron · Answer 2 · Thu Jan 04 2024 00:22:32 GMT+0800 (China Standard Time)

I notice your inventory IPs are on a different range than metal_lb_ip_range and apiserver_endpoint. Is it possible your issue is related to tis discussion?

#158 (comment)

I'll be honest, I thought that those were on addresses on the cluster network, didn't realize those needed to be modified to match my local LAN. I'll go ahead and update those IPs and re-run site.yml and report back.

Taylor Cohron · Answer 3 · Thu Jan 04 2024 00:22:50 GMT+0800 (China Standard Time)

My bad, hit close instead of just comment. 😅

Taylor Cohron · Answer 4 · Thu Jan 04 2024 00:36:13 GMT+0800 (China Standard Time)

@redmetablue was correct, I needed to update my metallb IP range and api-server IP to match my local LAN. Re-ran the site.yml and copied the kube config and confirmed connectivity to the cluster.