k8snetworkplumbingwg / sriov-network-operator

Operator for provisioning and configuring SR-IOV CNI plugin and device plugin

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

When node draining fails, sriov config daemon should fall back to no draining and reboot the node directly

dastonzerg opened this issue · comments

When it is in a cluster with more than 1 node, it is still possible that some PDB will prevent config daemon to do node draining successfully. For e.g. in a cluster where there is more than 1 node, but only one of the nodes have special label that PDB required pods can trying to select and go to, then those pods have no other nodes to go and node draining will fail there, which leaves the node in schedulingDisabled state because our current code return the reconcile as err (code) and doesn't proceed to reboot the node directly (code).

I think we should log the error out, but go ahead and reboot the node directly without the return, so the cluster won't stuck in a bad state. Also if we do this, we are trying the best effort to do pod evicting already, and the left over pods that cannot be drained will be left there. In addition, we also don't need to figure out if we need to add disableDrain accordingly like suggested in https://docs.okd.io/4.10/networking/hardware_networks/configuring-sriov-operator.html#nw-sriov-configuring-operator_configuring-sriov-operator

@SchSeba Hey Sebastian, would like to hear from your opinion there, thanks!

Hi @dastonzerg thanks for opening the issue!
I don't think this is something we can do. let me explain if we have a PDB for a pod there is a reason it's in place. for example (and it happened to me) I force rebooted a node that was running CEPH telling my self the PDB was not important and I broke the storage on a cluster was a bad afternoon restoring stuff...

so I will not go that way and about the node stuck in disable you can just remove the policy and it will remove the drain as we don't have anything to do on the node (if not that is a bug)

closing this one feel free to re open if needed