Nvidia GPU seems not recognized on OKD 4.5 Cluster to use on kubevirt.

Question

Nvidia GPU seems not recognized on OKD 4.5 Cluster to use on kubevirt.

rupang790 opened this issue 3 years ago · comments

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug

What happened:
Nvidia GPU is not recognized on OKD 4.5 cluster (Kubevirt-hyperconverged 1.4.0-unstable version).
Before I issued about kubevirt with gpu which is tested on Kubernetes and 1.4.0-unstable operator is updated.
#1227 (comment)

However, I tried to install kubevirt-gpu-dp on OKD 4.5 cluster with same way that I tested on Kubernetes, It seems not recognize the GPU device. (configure kernel command and VFIO-PCI)
I am not sure it is the issue with kubevirt, nivida or Kubernetes kubelet, so I would like to share and request this issue with all of you.

** Nvidia.com/gpu from results of command as below is recognized by gpu-operator. And also it was work with gpu-operator as well.

. From Worker

[core@worker01 ~]$ sudo lspci -nnk -d 10de:
86:00.0 3D controller [0302]: NVIDIA Corporation TU104GL [Tesla T4] [10de:1eb8] (rev a1)
	Subsystem: NVIDIA Corporation Device [10de:12a2]
	Kernel driver in use: vfio-pci
	Kernel modules: nouveau

. From Bastion

[root@okd-bastion01 gpu]# oc get node worker01.eluon.okd.com  -o json | jq '.status.allocatable'
{
  "cpu": "55500m",
  "devices.kubevirt.io/kvm": "110",
  "devices.kubevirt.io/tun": "110",
  "devices.kubevirt.io/vhost-net": "110",
  "ephemeral-storage": "1078621800299",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "130572740Ki",
  "nvidia.com/gpu": "0", 
  "pods": "250"
}

[root@okd-bastion01 gpu]# oc -n kubevirt-hyperconverged edit hyperconverged
...
spec:
  certConfig:
    ca:
      duration: 48h0m0s
      renewBefore: 24h0m0s
    server:
      duration: 24h0m0s
      renewBefore: 12h0m0s
  featureGates:
    sriovLiveMigration: false
    withHostPassthroughCPU: false
  infra: {}
  liveMigrationConfig:
    bandwidthPerMigration: 64Mi
    completionTimeoutPerGiB: 800
    parallelMigrationsPerCluster: 5
    parallelOutboundMigrationsPerNode: 2
    progressTimeout: 150
  permittedHostDevices:
    pciHostDevices:
    - externalResourceProvider: true
      pciVendorSelector: 10DE:1EB8
      resourceName: nvidia.com/TU104GL_Tesla_T4
  version: 1.4.0
  workloads: {}
...

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

HCO version (use oc get csv -n kubevirt-hyperconverged): 1.4.0-unstable
Kubernetes version (use kubectl version): Clinet: v1.18.2-0-g52c56ce / Server: v1.18.3
Cloud provider or hardware configuration:
Install tools: git clone and oc create -f
Others:

Simone Tiraboschi · Answer 1 · Thu Apr 29 2021 16:01:38 GMT+0800 (China Standard Time)

@rupang790 can you please share the output of?

oc get kubevirt -n kubevirt-hyperconverged   kubevirt-kubevirt-hyperconverged -o yaml

On my env I see:

$ oc get hco -n kubevirt-hyperconverged   kubevirt-hyperconverged -o json | jq '.spec.permittedHostDevices'
{
  "pciHostDevices": [
    {
      "externalResourceProvider": true,
      "pciVendorSelector": "10DE:1EB8",
      "resourceName": "nvidia.com/TU104GL_Tesla_T4"
    }
  ]
}
$ oc get kubevirt -n kubevirt-hyperconverged   kubevirt-kubevirt-hyperconverged -o json | jq '.spec.configuration.permittedHostDevices'
{
  "pciHostDevices": [
    {
      "externalResourceProvider": true,
      "pciVendorSelector": "10DE:1EB8",
      "resourceName": "nvidia.com/TU104GL_Tesla_T4"
    }
  ]
}

So I think that the configuration mechanism in the operator is working as designed.
Unfortunately I don't have that device to test it end to end.

rupang790 · Answer 2 · Fri Apr 30 2021 08:34:42 GMT+0800 (China Standard Time)

@tiraboschi the outputs of commands are below:

[root@okd-bastion01 ~]# oc get hco -n kubevirt-hyperconverged   kubevirt-hyperconverged -o json | jq '.spec.permittedHostDevices'
{
  "pciHostDevices": [
    {
      "externalResourceProvider": true,
      "pciVendorSelector": "10DE:1EB8",
      "resourceName": "nvidia.com/TU104GL_Tesla_T4"
    }
  ]
}
[root@okd-bastion01 ~]# oc get kubevirt -n kubevirt-hyperconverged   kubevirt-kubevirt-hyperconverged -o json | jq '.spec.configuration.permittedHostDevices'
{
  "pciHostDevices": [
    {
      "externalResourceProvider": true,
      "pciVendorSelector": "10DE:1EB8",
      "resourceName": "nvidia.com/TU104GL_Tesla_T4"
    }
  ]
}

It means that kubevirt is working as designed well, right? So maybe it is related to kubelet I think.
What version of OKD or OpenShift did your cluster use? Because I think issue came from version difference of kubelet.
Because I tested it on Native Kubernetes which use 1.20 version and lt worked well.

rupang790 · Answer 3 · Mon May 03 2021 16:41:12 GMT+0800 (China Standard Time)

@tiraboschi , Actually after change SELinux to permissive mode on GPU worker node, it recognize GPU.
Thank you.