Remote node flannel failed to operate

Question

Remote node flannel failed to operate

moonek opened this issue 7 years ago · comments

Expected Behavior

Successful canal pod

Current Behavior

kubectl get po -n kube-system -o wide
kNAME                                   READY     STATUS             RESTARTS   AGE       IP             NODE
canal-7pplx                            2/3       CrashLoopBackOff   12         46m       172.17.8.102   172.17.8.102
canal-dwfp8                            2/3       CrashLoopBackOff   12         46m       172.17.8.103   172.17.8.103
canal-l84s1                            3/3       Running            0          46m       172.17.8.101   172.17.8.101
kube-apiserver-172.17.8.101            1/1       Running            0          1h        172.17.8.101   172.17.8.101
kube-controller-manager-172.17.8.101   1/1       Running            0          1h        172.17.8.101   172.17.8.101
kube-proxy-172.17.8.101                1/1       Running            0          1h        172.17.8.101   172.17.8.101
kube-proxy-172.17.8.102                1/1       Running            0          1h        172.17.8.102   172.17.8.102
kube-proxy-172.17.8.103                1/1       Running            0          1h        172.17.8.103   172.17.8.103
kube-scheduler-172.17.8.101            1/1       Running            0          1h        172.17.8.101   172.17.8.101

Steps to Reproduce (for bugs)

coreos + k8s cluster install
kubelet configuration (--cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/opt/cni/bin --network-plugin=cni )
controller manager configuration (--cluster-cidr=10.244.0.0/16 --allocate-node-cidrs=true)
wget https://raw.githubusercontent.com/projectcalico/canal/master/k8s-install/canal.yaml
kubectl apply -f canal.yaml

Context

Tested with local virtualbox.
The canal on the same node as apiserver will run normally. (3/3 Running)
Canal running on remote node failed. (2/3 CrashLoopBackOff)
Looking at the flannel log, the output looks like this:

kubectl logs -f canal-dwfp8 -n kube-system -c kube-flannel
I0712 06:20:19.618913       1 main.go:459] Using interface with name eth1 and address 172.17.8.103
I0712 06:20:19.619093       1 main.go:476] Defaulting external address to interface address (172.17.8.103)
E0712 06:20:49.622290       1 main.go:223] Failed to create SubnetManager: error retrieving pod spec for 'kube-system/canal-dwfp8': Get https://10.3.0.1:443/api/v1/namespaces/kube-system/pods/canal-dwfp8: dial tcp 10.3.0.1:443: i/o timeout

Changing canal.yaml is the same. ("k8s_api_root": "https://172.17.8.101:443")
There is no overlay network when the flannel is running.
But how does the flannel communicate with the 10.3.0.1 address?

Your Environment

Vagrant + Virtualbox (MASTER: 172.17.8.101, WORKER: 172.17.8.102, 172.17.8.103)
Calico version: 1.2.1
Flannel version: 0.8.0
Orchestrator version: k8s 1.6.4 (no rbac mode)
Operating System and version: Container Linux by CoreOS 1437.0.0

Erik Stidham · Answer 1 · Tue Aug 01 2017 04:13:02 GMT+0800 (China Standard Time)

I'm guessing that 10.3.0.1 is the service address for the kubernetes service and kube-proxy should be setting up iptables rules that will do DNAT when something is sent to the 10.3.0.1 IP address and change the packet destination address to 172.17.8.101. You can check that is true by checking kubectl get services --all-namespaces.

You should check that from a worker that you are able to curl https://10.3.0.1, in a test cluster I was able to connect with curl but (unsurprisingly) I received curl: (60) SSL certificate problem: self signed certificate.... I'm imagining that isn't working though so you should also try curl https://172.17.8.101:443. If that doesn't work either you should try pinging between the hosts to verify that some type of communication is possible from the worker to the master.

I brought up a test cluster using the Vagrantfile and master/node-config.yaml here http://docs.projectcalico.org/v2.3/getting-started/kubernetes/installation/vagrant/, only updating the --cluster-cidr for the controller manager and was able to bring up canal with the manifest you linked. Though there have been updates to it since you opened this issue. You might just want to try installing canal again with the latest, that may resolve the issue.

Blfrg · Answer 2 · Fri Aug 18 2017 07:50:30 GMT+0800 (China Standard Time)

I have been experiencing a similar issue, though some differences in my setup are k8s 1.7.x with RBAC.

What resolved the issue for me was adding to the kube-apiserver the flag:
--advertise-address=172.17.8.101

Note: k8s_api_root still resolved to the 10.x.x.x address, but flannel and calico both were able to communicate with the master node.

Hope that helps!

Erik Stidham · Answer 3 · Fri Aug 18 2017 22:45:22 GMT+0800 (China Standard Time)

@Blfrg Thanks for reporting your solution. 👍

Erik Stidham · Answer 4 · Tue Sep 19 2017 02:05:17 GMT+0800 (China Standard Time)

@moonek have you been able to try @Blfrg's solution or found another solution for your issue?

Dan (Turk) Osborne · Answer 5 · Thu Feb 22 2018 09:04:44 GMT+0800 (China Standard Time)

Closing this issue as stale.