kubevirt / hyperconverged-cluster-operator

Operator pattern for managing multi-operator products

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Nvidia GPU seems not recognized on OKD 4.5 Cluster to use on kubevirt.

rupang790 opened this issue · comments

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug

What happened:
Nvidia GPU is not recognized on OKD 4.5 cluster (Kubevirt-hyperconverged 1.4.0-unstable version).
Before I issued about kubevirt with gpu which is tested on Kubernetes and 1.4.0-unstable operator is updated.
#1227 (comment)

However, I tried to install kubevirt-gpu-dp on OKD 4.5 cluster with same way that I tested on Kubernetes, It seems not recognize the GPU device. (configure kernel command and VFIO-PCI)
I am not sure it is the issue with kubevirt, nivida or Kubernetes kubelet, so I would like to share and request this issue with all of you.

** Nvidia.com/gpu from results of command as below is recognized by gpu-operator. And also it was work with gpu-operator as well.

. From Worker

[core@worker01 ~]$ sudo lspci -nnk -d 10de:
86:00.0 3D controller [0302]: NVIDIA Corporation TU104GL [Tesla T4] [10de:1eb8] (rev a1)
	Subsystem: NVIDIA Corporation Device [10de:12a2]
	Kernel driver in use: vfio-pci
	Kernel modules: nouveau

. From Bastion

[root@okd-bastion01 gpu]# oc get node worker01.eluon.okd.com  -o json | jq '.status.allocatable'
{
  "cpu": "55500m",
  "devices.kubevirt.io/kvm": "110",
  "devices.kubevirt.io/tun": "110",
  "devices.kubevirt.io/vhost-net": "110",
  "ephemeral-storage": "1078621800299",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "130572740Ki",
  "nvidia.com/gpu": "0", 
  "pods": "250"
}

[root@okd-bastion01 gpu]# oc -n kubevirt-hyperconverged edit hyperconverged
...
spec:
  certConfig:
    ca:
      duration: 48h0m0s
      renewBefore: 24h0m0s
    server:
      duration: 24h0m0s
      renewBefore: 12h0m0s
  featureGates:
    sriovLiveMigration: false
    withHostPassthroughCPU: false
  infra: {}
  liveMigrationConfig:
    bandwidthPerMigration: 64Mi
    completionTimeoutPerGiB: 800
    parallelMigrationsPerCluster: 5
    parallelOutboundMigrationsPerNode: 2
    progressTimeout: 150
  permittedHostDevices:
    pciHostDevices:
    - externalResourceProvider: true
      pciVendorSelector: 10DE:1EB8
      resourceName: nvidia.com/TU104GL_Tesla_T4
  version: 1.4.0
  workloads: {}
...

image

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • HCO version (use oc get csv -n kubevirt-hyperconverged): 1.4.0-unstable
  • Kubernetes version (use kubectl version): Clinet: v1.18.2-0-g52c56ce / Server: v1.18.3
  • Cloud provider or hardware configuration:
  • Install tools: git clone and oc create -f
  • Others:

@rupang790 can you please share the output of?

oc get kubevirt -n kubevirt-hyperconverged   kubevirt-kubevirt-hyperconverged -o yaml

On my env I see:

$ oc get hco -n kubevirt-hyperconverged   kubevirt-hyperconverged -o json | jq '.spec.permittedHostDevices'
{
  "pciHostDevices": [
    {
      "externalResourceProvider": true,
      "pciVendorSelector": "10DE:1EB8",
      "resourceName": "nvidia.com/TU104GL_Tesla_T4"
    }
  ]
}
$ oc get kubevirt -n kubevirt-hyperconverged   kubevirt-kubevirt-hyperconverged -o json | jq '.spec.configuration.permittedHostDevices'
{
  "pciHostDevices": [
    {
      "externalResourceProvider": true,
      "pciVendorSelector": "10DE:1EB8",
      "resourceName": "nvidia.com/TU104GL_Tesla_T4"
    }
  ]
}

So I think that the configuration mechanism in the operator is working as designed.
Unfortunately I don't have that device to test it end to end.

@tiraboschi the outputs of commands are below:

[root@okd-bastion01 ~]# oc get hco -n kubevirt-hyperconverged   kubevirt-hyperconverged -o json | jq '.spec.permittedHostDevices'
{
  "pciHostDevices": [
    {
      "externalResourceProvider": true,
      "pciVendorSelector": "10DE:1EB8",
      "resourceName": "nvidia.com/TU104GL_Tesla_T4"
    }
  ]
}
[root@okd-bastion01 ~]# oc get kubevirt -n kubevirt-hyperconverged   kubevirt-kubevirt-hyperconverged -o json | jq '.spec.configuration.permittedHostDevices'
{
  "pciHostDevices": [
    {
      "externalResourceProvider": true,
      "pciVendorSelector": "10DE:1EB8",
      "resourceName": "nvidia.com/TU104GL_Tesla_T4"
    }
  ]
}

It means that kubevirt is working as designed well, right? So maybe it is related to kubelet I think.
What version of OKD or OpenShift did your cluster use? Because I think issue came from version difference of kubelet.
Because I tested it on Native Kubernetes which use 1.20 version and lt worked well.

@tiraboschi , Actually after change SELinux to permissive mode on GPU worker node, it recognize GPU.
Thank you.