DNS I/O timeouts on fresh install of Talos for two nodes out of 8.

Question

DNS I/O timeouts on fresh install of Talos for two nodes out of 8.

no1here0003 opened this issue 2 months ago · comments

no1here0003 commented 2 months ago

Bug Report

Description

I had two nodes that would spam dns errors and lose WAN connectivity (LAN was ok, talosctl commands worked).

Logs

192.168.10.11: user: warning: [2024-06-02T23:06:07.945825935Z]: [talos] error serving dns request {"component": "dns-resolve-cache", "error": "read udp 192.168.10.11:58451->192.168.1.1:53: i/o timeout"}

192.168.10.11: user: warning: [2024-06-02T23:06:10.947386935Z]: [talos] failed refreshing discovery service data {"component": "controller-runtime", "controller": "cluster.DiscoveryServiceController", "error": "error updating local affiliate data: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp [2a01:111:f100:2000::a83e:3731]:443: connect: network is unreachable\""}

Note this error was coming out for any address with any DNS Nameserver.

Environment

Talos version: [talosctl version --nodes <problematic nodes>]

> talosctl version -n 192.168.10.11
Client:
	Tag:         v1.7.4
	SHA:         cb3a8308
	Built:
	Go version:  go1.22.3
	OS/Arch:     darwin/arm64
Server:
	NODE:        192.168.10.11
	Tag:         v1.7.4
	SHA:         cb3a8308
	Built:
	Go version:  go1.22.3
	OS/Arch:     linux/amd64
	Enabled:     RBAC

Kubernetes version: [kubectl version --short]
Server Version: v1.30.1
Platform:
Cilium 1.15.5, Talosconfig is here: https://github.com/no1here0003/Talos-cluster/blob/main/kubernetes/bootstrap/talos/talconfig.yaml

Baremetal NUCs, one was a NUC13ANHi7 the other was a NUC11PAHi7. Both have 64 Gb of RAM. Both use SOLO10G-SFP-T3. Both used image:

factory.talos.dev/installer/d0c3e835853d22377d0bb370d2058854b56c942fe0f9ee33d988b737c1842d90:v1.7.4

I run opnsense as well as Unifi switches.

Andrey Smirnov · Answer 1 · Tue Jun 04 2024 00:34:28 GMT+0800 (China Standard Time)

Please attach at least the output of talosctl support.

no1here0003 · Answer 2 · Tue Jun 04 2024 00:41:34 GMT+0800 (China Standard Time)

The problem is this is after the fact, although I can do it for a node that has since had it come out once

no1here0003 · Answer 3 · Tue Jun 04 2024 00:42:30 GMT+0800 (China Standard Time)

support.zip

Please note the message came out once on this node between yesterday and today, not taken from the spamming time, though I'll be sure to get it and attach it here if it happens again. I'll be moving nodes physically in a few hours so there's a chance it'll happen again.

Andrey Smirnov · Answer 4 · Tue Jun 04 2024 00:46:48 GMT+0800 (China Standard Time)

Network unreachable error sounds like misconfiguration of the network, it's not a hardware/connectivity issue.

no1here0003 · Answer 5 · Tue Jun 04 2024 00:50:43 GMT+0800 (China Standard Time)

Sounds like the attached zip file provided nothing then, please feel free to close this and I can open a new one when it happens again.

Other things on the network did not have this issue, including other nodes in the same cluster, so I wasn't sure where to try to help others not have the same issue. If it's not Talos related then I apologize.

Andrey Smirnov · Answer 6 · Tue Jun 04 2024 00:51:05 GMT+0800 (China Standard Time)

Ok, so network unreachable probably masks another IPv4 error, but this looks like a blip in connectivity to me in the logs you submitted, it doesn't look like it was related to anything that happened in the system so far.

Andrey Smirnov · Answer 7 · Tue Jun 04 2024 00:52:08 GMT+0800 (China Standard Time)

It might be Talos related, but nothing in the log points to it so far, so hard to say one or another.

no1here0003 · Answer 8 · Tue Jun 04 2024 00:53:15 GMT+0800 (China Standard Time)

Correct, I should have waited to see if I could capture it if it happens again at that time. It was constant for me and only started after the apply-config of my machine config is the other bit, but I get it. I can open a new one if it happens again, or I can close this one after I move the nodes.

I'm comfortable with either. Appreciate it.

no1here0003 · Answer 9 · Wed Jun 05 2024 07:37:20 GMT+0800 (China Standard Time)

Quick update - I apologize for the wait, the new power units have been late, I for sure will have everything moved tomorrow. I apologize for the delay. Thanks!

no1here0003 · Answer 10 · Wed Jun 05 2024 22:46:59 GMT+0800 (China Standard Time)

The error is happening on one of my control planes at the moment

Attaaching support.zip

support.zip

Andrey Smirnov · Answer 11 · Wed Jun 05 2024 23:09:25 GMT+0800 (China Standard Time)

I don't see anything suspicious in the logs except for i/o timeouts, make sure you have network connectivity, your CNI doesn't do weird things, etc. Your router/MTU is configured properly. It gets I/O timeout talking to both DNS servers, and that happens some time after boot.

no1here0003 · Answer 12 · Wed Jun 05 2024 23:14:32 GMT+0800 (China Standard Time)

Ok, I thought I'd get you what I could just in case. It sounds like maybe it is a network thing on my side. I appreciate you reviewing it. Thanks!