siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.

Home Page:https://www.talos.dev

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

DNS I/O timeouts on fresh install of Talos for two nodes out of 8.

no1here0003 opened this issue · comments

Bug Report

Description

I had two nodes that would spam dns errors and lose WAN connectivity (LAN was ok, talosctl commands worked).

Logs

192.168.10.11: user: warning: [2024-06-02T23:06:07.945825935Z]: [talos] error serving dns request {"component": "dns-resolve-cache", "error": "read udp 192.168.10.11:58451->192.168.1.1:53: i/o timeout"}

192.168.10.11: user: warning: [2024-06-02T23:06:10.947386935Z]: [talos] failed refreshing discovery service data {"component": "controller-runtime", "controller": "cluster.DiscoveryServiceController", "error": "error updating local affiliate data: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp [2a01:111:f100:2000::a83e:3731]:443: connect: network is unreachable\""}

Note this error was coming out for any address with any DNS Nameserver.

Environment

  • Talos version: [talosctl version --nodes <problematic nodes>]
> talosctl version -n 192.168.10.11
Client:
	Tag:         v1.7.4
	SHA:         cb3a8308
	Built:
	Go version:  go1.22.3
	OS/Arch:     darwin/arm64
Server:
	NODE:        192.168.10.11
	Tag:         v1.7.4
	SHA:         cb3a8308
	Built:
	Go version:  go1.22.3
	OS/Arch:     linux/amd64
	Enabled:     RBAC

Baremetal NUCs, one was a NUC13ANHi7 the other was a NUC11PAHi7. Both have 64 Gb of RAM. Both use SOLO10G-SFP-T3. Both used image:

factory.talos.dev/installer/d0c3e835853d22377d0bb370d2058854b56c942fe0f9ee33d988b737c1842d90:v1.7.4

I run opnsense as well as Unifi switches.

Please attach at least the output of talosctl support.

The problem is this is after the fact, although I can do it for a node that has since had it come out once

support.zip

Please note the message came out once on this node between yesterday and today, not taken from the spamming time, though I'll be sure to get it and attach it here if it happens again. I'll be moving nodes physically in a few hours so there's a chance it'll happen again.

Network unreachable error sounds like misconfiguration of the network, it's not a hardware/connectivity issue.

Sounds like the attached zip file provided nothing then, please feel free to close this and I can open a new one when it happens again.

Other things on the network did not have this issue, including other nodes in the same cluster, so I wasn't sure where to try to help others not have the same issue. If it's not Talos related then I apologize.

Ok, so network unreachable probably masks another IPv4 error, but this looks like a blip in connectivity to me in the logs you submitted, it doesn't look like it was related to anything that happened in the system so far.

It might be Talos related, but nothing in the log points to it so far, so hard to say one or another.

Correct, I should have waited to see if I could capture it if it happens again at that time. It was constant for me and only started after the apply-config of my machine config is the other bit, but I get it. I can open a new one if it happens again, or I can close this one after I move the nodes.

I'm comfortable with either. Appreciate it.

Quick update - I apologize for the wait, the new power units have been late, I for sure will have everything moved tomorrow. I apologize for the delay. Thanks!

The error is happening on one of my control planes at the moment

Attaaching support.zip

support.zip

I don't see anything suspicious in the logs except for i/o timeouts, make sure you have network connectivity, your CNI doesn't do weird things, etc. Your router/MTU is configured properly. It gets I/O timeout talking to both DNS servers, and that happens some time after boot.

Ok, I thought I'd get you what I could just in case. It sounds like maybe it is a network thing on my side. I appreciate you reviewing it. Thanks!