projectcalico / canal

Policy based networking for cloud native applications

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Needs to clear NodeNetworkUnavailable flag on Kubernetes 1.6

maikzumstrull opened this issue · comments

Expected Behavior

Kubernetes 1.6 has this lovely piece of code: https://github.com/kubernetes/kubernetes/blob/release-1.6/pkg/kubelet/kubelet_node_status.go#L214

This marks any new node as restricted to pods with host networking. The assumption is that the cluster networking implementation will clear this bit when the network setup is complete.

The only implementation that does that is kubenet, i.e. what they use to run GKE.

Discussion: kubernetes/kubernetes#33573

Said there:

This will require network plugins to manage the Node NoRouteCreated state on AWS in 1.5, as they already must do on GCE since 1.3.

Thing is, I think nobody actually did that on 1.3 or 1.4. Instead, we passed the --experimental-flannel-overlay flag which disabled this feature. However, that flag has been removed in 1.6.

Current Behavior

Nodes stay NetworkUnavailable forever, even though Canal is perfectly happy.

Possible Solution

After network setup, and possibly after passing some self-check, Canal should use the API to mark the node available.

IIUC, this doesn't affect canal because it uses CNI and network readiness is determined by the presence of a CNI config file on disk.

The CNI driver in the kubelet will clear this flag for us.

That is incorrect. There are two different readiness bits. Kubelet will set NodeReady=true when it finds a valid-looking CNI config. However, on an affected cloud provider (i.e. GCE), Kubelet will not clear the NetworkUnavailable bit for you.

In 1.7, the relevant snippet is https://github.com/kubernetes/kubernetes/blob/release-1.7/pkg/kubelet/kubelet_node_status.go#L231-L240.

Ah, you're right. I hadn't spotted that - reopening.

Is there any progress on this ?

Having the same issue with canal on Google Cloud on kubernetes 1.9.7

Canal is fine and i can even ping through it but the "NetworkUnavailable" flag is set and so i can't schedule any workload

as a workaround i enabled the --configure-cloud-routes=true on the kube-controller-manager and now the flag is cleared and pods are scheduled. That seems like a very dirty solution though since there is no need for those routes while using CANAL

I think this is something that needs to be fixed upstream in Kubernetes. The fundamental assumption that if a cloud-provider is set then you need to use cloud routes is wrong, and will affect more than just canal.

One quick-fix for those who are doing this might be to use a side-car container in the canal daemonset which sets this flag, e.g. by using kubectl patch within a container.

I understand.
To be honest i think the cloud-controller-manager actually address that issue, but since is still Alpha i am not currently using it

I am trying to workaround the issue by just letting the kube-controller-manager create the routes it expect , if that does not work i will see to patch the node status with a sidecar container.
I guess i would feel better if canal did it just because i would trust there would be more healthchecking done than what i can come up with in a sidecar container.

Maybe we can do like Weave, set Kubernetes NodeNetworkUnavailable to false when calico start ? :
weaveworks/weave#3307

Thanks @aarnaud!

This will be available in Calico v3.4.0.