Needs to clear NodeNetworkUnavailable flag on Kubernetes 1.6

Question

Needs to clear NodeNetworkUnavailable flag on Kubernetes 1.6

maikzumstrull opened this issue 7 years ago · comments

Expected Behavior

Kubernetes 1.6 has this lovely piece of code: https://github.com/kubernetes/kubernetes/blob/release-1.6/pkg/kubelet/kubelet_node_status.go#L214

This marks any new node as restricted to pods with host networking. The assumption is that the cluster networking implementation will clear this bit when the network setup is complete.

The only implementation that does that is kubenet, i.e. what they use to run GKE.

Discussion: kubernetes/kubernetes#33573

Said there:

This will require network plugins to manage the Node NoRouteCreated state on AWS in 1.5, as they already must do on GCE since 1.3.

Thing is, I think nobody actually did that on 1.3 or 1.4. Instead, we passed the --experimental-flannel-overlay flag which disabled this feature. However, that flag has been removed in 1.6.

Current Behavior

Nodes stay NetworkUnavailable forever, even though Canal is perfectly happy.

Possible Solution

After network setup, and possibly after passing some self-check, Canal should use the API to mark the node available.

Casey Davenport · Answer 1 · Fri Aug 04 2017 06:24:18 GMT+0800 (China Standard Time)

IIUC, this doesn't affect canal because it uses CNI and network readiness is determined by the presence of a CNI config file on disk.

The CNI driver in the kubelet will clear this flag for us.

Maik Zumstrull · Answer 2 · Fri Aug 04 2017 06:32:56 GMT+0800 (China Standard Time)

That is incorrect. There are two different readiness bits. Kubelet will set NodeReady=true when it finds a valid-looking CNI config. However, on an affected cloud provider (i.e. GCE), Kubelet will not clear the NetworkUnavailable bit for you.

In 1.7, the relevant snippet is https://github.com/kubernetes/kubernetes/blob/release-1.7/pkg/kubelet/kubelet_node_status.go#L231-L240.

Casey Davenport · Answer 3 · Fri Aug 04 2017 13:10:46 GMT+0800 (China Standard Time)

Ah, you're right. I hadn't spotted that - reopening.

Francesco Ciocchetti · Answer 4 · Wed May 16 2018 23:30:59 GMT+0800 (China Standard Time)

Is there any progress on this ?

Having the same issue with canal on Google Cloud on kubernetes 1.9.7

Canal is fine and i can even ping through it but the "NetworkUnavailable" flag is set and so i can't schedule any workload

Francesco Ciocchetti · Answer 5 · Wed May 16 2018 23:43:16 GMT+0800 (China Standard Time)

as a workaround i enabled the --configure-cloud-routes=true on the kube-controller-manager and now the flag is cleared and pods are scheduled. That seems like a very dirty solution though since there is no need for those routes while using CANAL

Casey Davenport · Answer 6 · Thu May 17 2018 18:07:15 GMT+0800 (China Standard Time)

I think this is something that needs to be fixed upstream in Kubernetes. The fundamental assumption that if a cloud-provider is set then you need to use cloud routes is wrong, and will affect more than just canal.

One quick-fix for those who are doing this might be to use a side-car container in the canal daemonset which sets this flag, e.g. by using kubectl patch within a container.

Francesco Ciocchetti · Answer 7 · Thu May 17 2018 19:00:51 GMT+0800 (China Standard Time)

I understand.
To be honest i think the cloud-controller-manager actually address that issue, but since is still Alpha i am not currently using it

I am trying to workaround the issue by just letting the kube-controller-manager create the routes it expect , if that does not work i will see to patch the node status with a sidecar container.
I guess i would feel better if canal did it just because i would trust there would be more healthchecking done than what i can come up with in a sidecar container.

Anthony · Answer 8 · Fri Oct 19 2018 03:30:59 GMT+0800 (China Standard Time)

Maybe we can do like Weave, set Kubernetes NodeNetworkUnavailable to false when calico start ? :
weaveworks/weave#3307

Anthony · Answer 9 · Thu Oct 25 2018 05:38:20 GMT+0800 (China Standard Time)

projectcalico/node#89

Casey Davenport · Answer 10 · Sun Nov 04 2018 06:00:50 GMT+0800 (China Standard Time)

Thanks @aarnaud!

This will be available in Calico v3.4.0.