Nvidia GPU seems not recognized on OKD 4.5 Cluster to use on kubevirt.
rupang790 opened this issue · comments
Is this a BUG REPORT or FEATURE REQUEST?:
Uncomment only one, leave it on its own line:
/kind bug
What happened:
Nvidia GPU is not recognized on OKD 4.5 cluster (Kubevirt-hyperconverged 1.4.0-unstable version).
Before I issued about kubevirt with gpu which is tested on Kubernetes and 1.4.0-unstable operator is updated.
#1227 (comment)
However, I tried to install kubevirt-gpu-dp on OKD 4.5 cluster with same way that I tested on Kubernetes, It seems not recognize the GPU device. (configure kernel command and VFIO-PCI)
I am not sure it is the issue with kubevirt, nivida or Kubernetes kubelet, so I would like to share and request this issue with all of you.
** Nvidia.com/gpu
from results of command as below is recognized by gpu-operator
. And also it was work with gpu-operator as well.
. From Worker
[core@worker01 ~]$ sudo lspci -nnk -d 10de:
86:00.0 3D controller [0302]: NVIDIA Corporation TU104GL [Tesla T4] [10de:1eb8] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:12a2]
Kernel driver in use: vfio-pci
Kernel modules: nouveau
. From Bastion
[root@okd-bastion01 gpu]# oc get node worker01.eluon.okd.com -o json | jq '.status.allocatable'
{
"cpu": "55500m",
"devices.kubevirt.io/kvm": "110",
"devices.kubevirt.io/tun": "110",
"devices.kubevirt.io/vhost-net": "110",
"ephemeral-storage": "1078621800299",
"hugepages-1Gi": "0",
"hugepages-2Mi": "0",
"memory": "130572740Ki",
"nvidia.com/gpu": "0",
"pods": "250"
}
[root@okd-bastion01 gpu]# oc -n kubevirt-hyperconverged edit hyperconverged
...
spec:
certConfig:
ca:
duration: 48h0m0s
renewBefore: 24h0m0s
server:
duration: 24h0m0s
renewBefore: 12h0m0s
featureGates:
sriovLiveMigration: false
withHostPassthroughCPU: false
infra: {}
liveMigrationConfig:
bandwidthPerMigration: 64Mi
completionTimeoutPerGiB: 800
parallelMigrationsPerCluster: 5
parallelOutboundMigrationsPerNode: 2
progressTimeout: 150
permittedHostDevices:
pciHostDevices:
- externalResourceProvider: true
pciVendorSelector: 10DE:1EB8
resourceName: nvidia.com/TU104GL_Tesla_T4
version: 1.4.0
workloads: {}
...
What you expected to happen:
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment:
- HCO version (use
oc get csv -n kubevirt-hyperconverged
): 1.4.0-unstable - Kubernetes version (use
kubectl version
): Clinet: v1.18.2-0-g52c56ce / Server: v1.18.3 - Cloud provider or hardware configuration:
- Install tools: git clone and oc create -f
- Others:
@rupang790 can you please share the output of?
oc get kubevirt -n kubevirt-hyperconverged kubevirt-kubevirt-hyperconverged -o yaml
On my env I see:
$ oc get hco -n kubevirt-hyperconverged kubevirt-hyperconverged -o json | jq '.spec.permittedHostDevices'
{
"pciHostDevices": [
{
"externalResourceProvider": true,
"pciVendorSelector": "10DE:1EB8",
"resourceName": "nvidia.com/TU104GL_Tesla_T4"
}
]
}
$ oc get kubevirt -n kubevirt-hyperconverged kubevirt-kubevirt-hyperconverged -o json | jq '.spec.configuration.permittedHostDevices'
{
"pciHostDevices": [
{
"externalResourceProvider": true,
"pciVendorSelector": "10DE:1EB8",
"resourceName": "nvidia.com/TU104GL_Tesla_T4"
}
]
}
So I think that the configuration mechanism in the operator is working as designed.
Unfortunately I don't have that device to test it end to end.
@tiraboschi the outputs of commands are below:
[root@okd-bastion01 ~]# oc get hco -n kubevirt-hyperconverged kubevirt-hyperconverged -o json | jq '.spec.permittedHostDevices'
{
"pciHostDevices": [
{
"externalResourceProvider": true,
"pciVendorSelector": "10DE:1EB8",
"resourceName": "nvidia.com/TU104GL_Tesla_T4"
}
]
}
[root@okd-bastion01 ~]# oc get kubevirt -n kubevirt-hyperconverged kubevirt-kubevirt-hyperconverged -o json | jq '.spec.configuration.permittedHostDevices'
{
"pciHostDevices": [
{
"externalResourceProvider": true,
"pciVendorSelector": "10DE:1EB8",
"resourceName": "nvidia.com/TU104GL_Tesla_T4"
}
]
}
It means that kubevirt is working as designed well, right? So maybe it is related to kubelet I think.
What version of OKD or OpenShift did your cluster use? Because I think issue came from version difference of kubelet.
Because I tested it on Native Kubernetes which use 1.20 version and lt worked well.
@tiraboschi , Actually after change SELinux to permissive mode on GPU worker node, it recognize GPU.
Thank you.