k8snetworkplumbingwg / sriov-network-operator

Operator for provisioning and configuring SR-IOV CNI plugin and device plugin

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

mellanox vendor plugin cannot register under certain conditions

qiaoning opened this issue · comments

https://github.com/k8snetworkplumbingwg/sriov-network-operator/blob/master/pkg/daemon/daemon.go#L541C2-L548

	// load plugins if it has not loaded
	if len(dn.enabledPlugins) == 0 {
		dn.enabledPlugins, err = enablePlugins(dn.platform, dn.useSystemdService, latestState, dn.hostManager, dn.storeManager)
		if err != nil {
			log.Log.Error(err, "nodeStateSyncHandler(): failed to enable vendor plugins")
			return err
		}
	}

if latestState.status is nil, only k8s_plugin and generic_plugin registed, vendor plugin(such as mellanox_plugin) will never be registered in this case, even if generation of SriovNetworkNodeState is changed, because only check dn.enabledPlugins's count, plugin name is NOT checked.

A solution should be like this code template:

	// load plugins if has not loaded
	if len(dn.enabledPlugins) == 0 {
		dn.enabledPlugins, err = enablePlugins(dn.platform, dn.useSystemdService, latestState, dn.hostManager, dn.storeManager)
		if err != nil {
			log.Log.Error(err, "nodeStateSyncHandler(): failed to enable vendor plugins")
			return err
		}
	} else {
		// load mellanox vendor plugin if has not loaded before
		if _, exist := dn.enabledPlugins[mellanox.PluginName]; !exist {
			vendorPlugins, err := registerVendorPlugins(latestState)
			if err != nil {
                                log.Log.Error(err, "nodeStateSyncHandler(): failed to enable mellanox vendor plugins error")
				return err
			}
			if _, exist := vendorPlugins[mellanox.PluginName]; !exist {
				glog.V(0).Info("nodeStateSyncHandler(): mellanox plugin NOT registered.")
			} else {
				dn.enabledPlugins[mellanox.PluginName] = vendorPlugins[mellanox.PluginName]
				glog.V(0).Info("nodeStateSyncHandler(): mellanox plugin registered.")
			}
		}
	}

I think that is not the case because we always run the pollNic before we go here no?

I think that is not the case because we always run the pollNic before we go here no?

Corrected: the code is from release 1.2.0: https://github.com/k8snetworkplumbingwg/sriov-network-operator/blob/v1.2.0/pkg/daemon/daemon.go#L426

It's triggered by SriovNetworkNodeState changed, and handled with nodeStateSyncHandler function.
And We actually encountered this situation in production environment.

will you be able to use a new version of the operator? :)

will you be able to use a new version of the operator? :)

https://github.com/k8snetworkplumbingwg/sriov-network-operator/releases

The latest release version is v1.2.0, so shall we use master code to deploy to production environment ?

you are right we are going to do a new release of the operator soon sorry about that

We plan to use sriov-network-operator 1.3.0 release version in our production environment, and we'll continue watching this bug.
Thanks a lot.

Hi @qiaoning please check the latest release we did as we fix also a critical issue in the sriov-cni