Canal/Calico problems on Fedora CoreOS

Question

Canal/Calico problems on Fedora CoreOS

Jonas18175 opened this issue 3 years ago · comments

Hello,

I setup a rancher cluster on Fedora CoreOS as base. We have problems with aborted connections between pods which results in restarts because of failed health requests. I used Calico as Network CNI in IPIP mode. Because I didn't find the reason for the connection losts I tried to switch to canal, because it use vxlan as default. It was running perfectly until today - the complete CNI had a outtage, but all pods shows healhy. The only way to solve that problem was to switch temporary to a other CNI (Calico) and switch back. But after that the connections are very unstable. Since last week I had a error in log on CNI Pod "calico-node" -> `[ERROR][47] felix/route_table.go 920: Failed to get link attributes error=interface not present ifaceRegex="^vxlan.calico$" ipVersion=0x4'

The host never had an interface with the name.
Here are all "main-interfaces":

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: ens192: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:50:56:bd:a2:79 brd ff:ff:ff:ff:ff:ff
    altname enp11s0
    inet 172.25.31.242/23 brd 172.25.31.255 scope global noprefixroute ens192
       valid_lft forever preferred_lft forever
    inet 172.25.30.14/23 scope global secondary ens192
       valid_lft forever preferred_lft forever
3: tunl0@NONE: <NOARP,UP,LOWER_UP> mtu 1480 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ipip 0.0.0.0 brd 0.0.0.0
    inet 10.42.2.1/32 scope global tunl0
       valid_lft forever preferred_lft forever
4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
    link/ether 02:42:8c:b7:83:1f brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
11: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default
    link/ether de:09:58:f9:1a:77 brd ff:ff:ff:ff:ff:ff
    inet 10.42.2.0/32 brd 10.42.2.0 scope global flannel.1
       valid_lft forever preferred_lft forever

It seems that the configuration on the host is incomplete.
Note: All recommended & required kernel modules are present.

How can the problem be solved?

Expected Behavior

Stable network connections between pods.

Current Behavior

Unstable network connections and failed health checks.

Possible Solution

Steps to Reproduce (for bugs)

Context

Show at beginning.

Your Environment

Calico version: v3.17.2
Flannel version: v0.13.0
Orchestrator version: v1.20.4
Operating System and version: Fedora CoreOS 34.20210529.3.0
Link to your project (optional):

Casey Davenport · Answer 1 · Wed Jun 30 2021 00:46:51 GMT+0800 (China Standard Time)

Since last week I had a error in log on CNI Pod "calico-node" -> `[ERROR][47] felix/route_table.go 920: Failed to get link attributes error=interface not present ifaceRegex="^vxlan.calico$" ipVersion=0x4'

This sounds like Felix has VXLAN enabled, which would normally tell it to create the vxlan.calico device. It might be failing because flannel is also running and trying to configure VXLAN on the node? not sure. Switching between CNI plugins on the same cluster is bound to cause issues, though.

Generally, changing CNI plugins requires a fresh node to prevent configuration from the old plugin impacting the new.

jtschoch · Answer 2 · Fri Jul 16 2021 01:02:31 GMT+0800 (China Standard Time)

I tryed it on a fresh node, but it is the same result.
At the moment I get the errors in shorter intervals like 5-10 minutes. The pods are connected to a database cluster inside kubernetes which has a ha proxy with 3 replicas - so the database connections drop very rare, but many packages don't reach the target - all deployments have gateway timeout (504) errors in a interval of 5-10 minutes. Flannel is not running. I think RKE is missconfiguring the host. Any possible way to fix that?

EDIT:
Fedora CoreOS has no xt_icmp and nf_conntrack_proto_sctp kernelmodule and i don't found anything how to get this modules on this OS

jtschoch · Answer 3 · Tue Jul 27 2021 13:45:08 GMT+0800 (China Standard Time)

Update: I setup fresh nodes with OpenSuse - but old etcd datastore.
After RKE setup it has the same problems with calico and canal. So I tried to remove the CNI with RKE also set it to none and I installed the latest version from the Calico "Helm Chart" from the offical page. This works, but it has also some connection issues but only 5-24x per day. The Logs are also spammed with the message: `[ERROR][47] felix/route_table.go 920: Failed to get link attributes error=interface not present ifaceRegex="^vxlan.calico$" ipVersion=0x4'

Are some way to fix that?