Troubleshooting VirtualGL with NVIDIA GPU Operator in EKS
Mohamed-ben-khemis opened this issue · comments
Troubleshooting VirtualGL with NVIDIA GPU Operator in EKS
Issue Summary
Encountering issues with VirtualGL failing to detect GPUs within my EKS (Amazon Elastic Kubernetes Service) cluster using the NVIDIA GPU Operator. Despite confirming GPU presence with nvidia-smi
, running glxgears
with GPU acceleration using vglrun
results in the following error:
vglrun -d /dev/nvidia0 glxgears
[VGL] ERROR: in init3D--
[VGL] 228: Invalid EGL device
Details
- EKS Instance Type: g4dn.xlarge
- GPU: Tesla T4
- NVIDIA GPU Operator Helm Chart: v23.9.2
- Container Environment: Kubernetes with NVIDIA GPU Operator
- Command Used:
vglrun -d /dev/nvidia0 glxgears
Issue
VirtualGL (vglrun) fails to initialize the 3D environment (glxgears
) with an "Invalid EGL device" error when attempting GPU acceleration.
Questions
- How can I troubleshoot and resolve the issue of VirtualGL failing to detect and utilize GPUs within my container environment?
- Are there additional configurations or dependencies required to enable GPU acceleration with VirtualGL on EKS using the NVIDIA GPU Operator?
Additional Information
- Output of
nvidia-smi
within the container confirms GPU presence and functionality.
@ubuntu-fk5a8-91b4d208t9nxv:/etc/X11/xorg.conf.d$ nvidia-smi
Fri May 3 11:14:58 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 25C P8 14W / 70W | 0MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
- Support NVIDIA gpus for graphics acceleration:
RUN echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf && \
echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf
COPY src/install/nvidia/10_nvidia.json /usr/share/glvnd/egl_vendor.d/10_nvidia.json:
ENV DEBIAN_FRONTEND=noninteractive \
INST_SCRIPTS=/dockerstartup/install \
LANG=$LANG \
LANGUAGE=$LANGUAGE \
LC_ALL=$LC_ALL \
LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/lib/i386-linux-gnu:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 \
NVIDIA_DRIVER_CAPABILITIES=all \
- 10_nvidia.json:
{
"file_format_version" : "1.0.0",
"ICD" : {
"library_path" : "libEGL_nvidia.so.0"
}
}
- terraform installation:
/********************************************
GPU Operator Configuration
********************************************/
resource "helm_release" "gpu_operator" {
count = length(data.aws_instances.nodes) > 0 ? 1 : 0
name = "gpu-operator"
repository = "https://helm.ngc.nvidia.com/nvidia"
chart = "gpu-operator"
version = var.nvaie ? var.nvaie_gpu_operator_version : var.gpu_operator_version
namespace = var.gpu_operator_namespace
create_namespace = true
atomic = true
cleanup_on_fail = true
reset_values = true
replace = true
set {
name = "driver.version"
value = var.nvaie ? var.nvaie_gpu_operator_driver_version : var.gpu_operator_driver_version
}
}
- Pod Spec:
apiVersion: v1
kind: Pod
metadata:
name: ubuntu-fk5a8-91b4d208t9nxv
namespace: test6q55c
spec:
containers:
- args:
- --display-addr
- unix:///var/run/project/display.sock
- --user-id
- "9000"
- --pulse-server
- /run/user/9000/pulse/native
image: ****/proxy:v1.0.156
imagePullPolicy: IfNotPresent
name: project-proxy
ports:
- containerPort: 8443
name: web
protocol: TCP
resources:
requests:
cpu: 150m
memory: 50M
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /tmp
name: tmp
- mountPath: /run
name: run
- mountPath: /run/lock
name: run-lock
- mountPath: /etc/project/tls/server
name: tls
readOnly: true
- mountPath: /var/run/project
name: vnc-sock
- mountPath: /mnt/home
name: home
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-wnppn
readOnly: true
- env:
- name: USER
value: project
- name: UID
value: "9000"
- name: HOME
value: /home/project
- name: LANG
value: en_US.UTF-8
- name: LANGUAGE
value: en_US:en
- name: DISPLAY_SOCK_ADDR
value: /var/run/project/display.sock
- name: ENABLE_ROOT
value: "true"
image: ****/ubuntu-xfce4:v1.0.0
imagePullPolicy: IfNotPresent
lifecycle:
preStop:
exec:
command:
- kill
- -s
- SIGRTMIN+3
- "1"
name: desktop
resources:
limits:
nvidia.com/gpu: "1"
requests:
nvidia.com/gpu: "1"
securityContext:
capabilities:
add:
- SYS_ADMIN
privileged: true
runAsUser: 0
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /run
name: run
- mountPath: /run/lock
name: run-lock
- mountPath: /dev/shm
name: shm
- mountPath: /home/user
name: home
- mountPath: /tmp
name: tmp
- mountPath: /var/run/user
name: vnc-sock
- mountPath: /sys/fs/cgroup
name: cgroups
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-wnppn
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
fsGroup: 9000
runAsNonRoot: false
serviceAccount: default
serviceAccountName: default
subdomain: ubuntu-fk5a8-91b4d208t9nxv
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
volumes:
- emptyDir: {}
name: run
- emptyDir: {}
name: run-lock
- hostPath:
path: /dev/shm
type: ""
name: shm
- name: tls
secret:
defaultMode: 420
secretName: ubuntu-fk5a8-91b4d208t9nxv
- emptyDir: {}
name: tmp
- emptyDir: {}
name: vnc-sock
- emptyDir: {}
name: home
- hostPath:
path: /sys/fs/cgroup
type: ""
name: cgroups
- name: kube-api-access-wnppn
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
As specified in the User's Guide, the device should be a DRI device path (/dev/dri/card[0-9]
) or an EGL device name (egl[0-9]
.)