Proxmox Cloud-init deploy, failing at "Copy vip manifest to first master" task.

Question

Proxmox Cloud-init deploy, failing at "Copy vip manifest to first master" task.

untraceablez opened this issue a year ago · comments

The issue is when running ansible-playbook site.yml The playbook runs up until the "copy vip manifest to first master" task, at which point it fails.

Expected Behavior

Playbook should run all the way through, setup an HA cluster running with 3 control nodes and 7 worker nodes.

Current Behavior

Ansible playbook will run just fine until the task "copy vip manifest to first master", then stop, showing a failure for all 3 master nodes, with all 7 worker nodes going through the playbook just fine.

Steps to Reproduce

Fork the repo
Changed inventory to match my local lan IPs for VMs
Changed variables in all.yml to match local environment (including IPs since I'm on a 10.0.0./24 network)
Adjusted ansible.cfg to inventory location, added private ssh-key.
Removed raspberry-pi role from playbook

Context (variables)

Operating system:

Hypervisor: Proxmox VE 8
Ansible Controller OS: Ubuntu 22.04

Node OS: Ubuntu 22.04 based off jammy cloud init image.

Hardware:

Intel i9 13900
RAM 128 GB DDR5 5200MHz
MSI Pro Series Z790 Motherboard
4 NVMe RAID-10 Array (4 x 1TB)
2 HDD MIRROR Array (2 x 6TB)

Variables Used

all.yml

k3s_version: "v1.25.12+k3s1"
ansible_user: ansible
systemd_dir: "/etc/systemd/system"

flannel_iface: "eth0"

apiserver_endpoint: "10.0.0.222"

k3s_token: "NA"

extra_server_args:

  {{ extra_args }}
  {{ '--node-taint node-role.kubernetes.io/master=true:NoSchedule' if k3s_master_taint else '' }}
  --tls-san {{ apiserver_endpoint }}
  --disable servicelb
  --disable traefik

extra_agent_args: 

  --flannel-iface={{ flannel_iface }}
  --node-ip={{ k3s_node_ip }}

kube_vip_tag_version: "v0.5.12"

metal_lb_speaker_tag_version: "v0.13.9"
metal_lb_controller_tag_version: "v0.13.9"

metal_lb_ip_range: "10.0.0.80-10.0.0.90"

Hosts

host.ini

[k3s_cluster:children]
master
node

[master]
node01 ansible_host=10.0.0.178
node02 ansible_host=10.0.0.225
node03 ansible_host=10.0.0.47

[node]
node04 ansible_host=10.0.0.251
node05 ansible_host=10.0.0.142
node06 ansible_host=10.0.0.237
node07 ansible_host=10.0.0.137
node08 ansible_host=10.0.0.231
node09 ansible_host=10.0.0.118
node10 ansible_host=10.0.0.67

Possible Solution

I've checked the General Troubleshooting Guide

k3s_ansible_vip_manifest_playbook_error.txt

Taylor Cohron · Answer 1 · Sat Sep 09 2023 06:34:15 GMT+0800 (China Standard Time)

I actually resolved this by just remaking the nodes from cloud-init and changing from node01, node02 etc naming scheme to control-node01... and worker-node01... Not sure why that made a difference, but it did! I suspect there's likely something else I did right this time that I just thought I'd done correctly previously.