Mellanox / k8s-rdma-shared-dev-plugin

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[bug] RDMA shared device plugin should fail if a resource does not have all RDMA subresources required

adrianchiris opened this issue · comments

Today, it just passes a subset of /dev/infiniband files which is unusable for an RDMA workload.
there should be a check that ensures all core RDMA resources are available and fail otherwise.
this may happen if:

  1. only a subset of RDMA related kmods where loaded
  2. driver is being reloaded at the time device plugin loads on system boot in a k8s setup where a MOFED driver container is used.

@moshe010 Is there a case where its OK for rdma-shared dp to report 0 resources for a specific resource ?

what if i define resources for both ETH and IB of the same physicial device, then change the link type
(when selector is according to netdev name)

Im thinking to address with the following behaviour:

  1. if device found with missing RDMA resources fail device plugin
  2. if device not found skip
  3. have a goroutine method running to monitor the resource and update kubelet.

addressed by PR#24 and PR#26