mellanox vendor plugin cannot register under certain conditions
qiaoning opened this issue · comments
// load plugins if it has not loaded
if len(dn.enabledPlugins) == 0 {
dn.enabledPlugins, err = enablePlugins(dn.platform, dn.useSystemdService, latestState, dn.hostManager, dn.storeManager)
if err != nil {
log.Log.Error(err, "nodeStateSyncHandler(): failed to enable vendor plugins")
return err
}
}
if latestState.status
is nil
, only k8s_plugin and generic_plugin registed, vendor plugin(such as mellanox_plugin) will never be registered in this case, even if generation of SriovNetworkNodeState
is changed, because only check dn.enabledPlugins
's count, plugin name is NOT checked.
A solution should be like this code template:
// load plugins if has not loaded
if len(dn.enabledPlugins) == 0 {
dn.enabledPlugins, err = enablePlugins(dn.platform, dn.useSystemdService, latestState, dn.hostManager, dn.storeManager)
if err != nil {
log.Log.Error(err, "nodeStateSyncHandler(): failed to enable vendor plugins")
return err
}
} else {
// load mellanox vendor plugin if has not loaded before
if _, exist := dn.enabledPlugins[mellanox.PluginName]; !exist {
vendorPlugins, err := registerVendorPlugins(latestState)
if err != nil {
log.Log.Error(err, "nodeStateSyncHandler(): failed to enable mellanox vendor plugins error")
return err
}
if _, exist := vendorPlugins[mellanox.PluginName]; !exist {
glog.V(0).Info("nodeStateSyncHandler(): mellanox plugin NOT registered.")
} else {
dn.enabledPlugins[mellanox.PluginName] = vendorPlugins[mellanox.PluginName]
glog.V(0).Info("nodeStateSyncHandler(): mellanox plugin registered.")
}
}
}
I think that is not the case because we always run the pollNic before we go here no?
I think that is not the case because we always run the pollNic before we go here no?
Corrected: the code is from release 1.2.0: https://github.com/k8snetworkplumbingwg/sriov-network-operator/blob/v1.2.0/pkg/daemon/daemon.go#L426
It's triggered by SriovNetworkNodeState changed, and handled with nodeStateSyncHandler function.
And We actually encountered this situation in production environment.
will you be able to use a new version of the operator? :)
will you be able to use a new version of the operator? :)
https://github.com/k8snetworkplumbingwg/sriov-network-operator/releases
The latest release version is v1.2.0, so shall we use master code to deploy to production environment ?
you are right we are going to do a new release of the operator soon sorry about that
We plan to use sriov-network-operator 1.3.0 release version in our production environment, and we'll continue watching this bug.
Thanks a lot.
Hi @qiaoning please check the latest release we did as we fix also a critical issue in the sriov-cni