Troubleshooting VirtualGL with NVIDIA GPU Operator in EKS

Question

Troubleshooting VirtualGL with NVIDIA GPU Operator in EKS

Mohamed-ben-khemis opened this issue 2 months ago · comments

Troubleshooting VirtualGL with NVIDIA GPU Operator in EKS

Issue Summary

Encountering issues with VirtualGL failing to detect GPUs within my EKS (Amazon Elastic Kubernetes Service) cluster using the NVIDIA GPU Operator. Despite confirming GPU presence with nvidia-smi, running glxgears with GPU acceleration using vglrun results in the following error:

vglrun -d /dev/nvidia0 glxgears
[VGL] ERROR: in init3D--
[VGL] 228: Invalid EGL device

Details

EKS Instance Type: g4dn.xlarge
GPU: Tesla T4
NVIDIA GPU Operator Helm Chart: v23.9.2
Container Environment: Kubernetes with NVIDIA GPU Operator
Command Used: vglrun -d /dev/nvidia0 glxgears

Issue

VirtualGL (vglrun) fails to initialize the 3D environment (glxgears) with an "Invalid EGL device" error when attempting GPU acceleration.

Questions

How can I troubleshoot and resolve the issue of VirtualGL failing to detect and utilize GPUs within my container environment?
Are there additional configurations or dependencies required to enable GPU acceleration with VirtualGL on EKS using the NVIDIA GPU Operator?

Additional Information

Output of nvidia-smi within the container confirms GPU presence and functionality.

@ubuntu-fk5a8-91b4d208t9nxv:/etc/X11/xorg.conf.d$ nvidia-smi 
Fri May  3 11:14:58 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       On  |   00000000:00:1E.0 Off |                    0 |
| N/A   25C    P8             14W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Support NVIDIA gpus for graphics acceleration:

RUN echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf && \
    echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf
COPY src/install/nvidia/10_nvidia.json /usr/share/glvnd/egl_vendor.d/10_nvidia.json:
ENV DEBIAN_FRONTEND=noninteractive \
    INST_SCRIPTS=/dockerstartup/install \
    LANG=$LANG \
    LANGUAGE=$LANGUAGE \
    LC_ALL=$LC_ALL \
    LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/lib/i386-linux-gnu:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 \
    NVIDIA_DRIVER_CAPABILITIES=all \

10_nvidia.json:

{
    "file_format_version" : "1.0.0",
    "ICD" : {
        "library_path" : "libEGL_nvidia.so.0"
    }
}

terraform installation:

/********************************************
  GPU Operator Configuration
********************************************/
resource "helm_release" "gpu_operator" {
  count            = length(data.aws_instances.nodes) > 0 ? 1 : 0
  name             = "gpu-operator"
  repository       = "https://helm.ngc.nvidia.com/nvidia"
  chart            = "gpu-operator"
  version          = var.nvaie ? var.nvaie_gpu_operator_version : var.gpu_operator_version
  namespace        = var.gpu_operator_namespace
  create_namespace = true
  atomic           = true
  cleanup_on_fail  = true
  reset_values     = true
  replace          = true

  set {
    name  = "driver.version"
    value = var.nvaie ? var.nvaie_gpu_operator_driver_version : var.gpu_operator_driver_version
  }
}

Pod Spec:

apiVersion: v1
kind: Pod
metadata:
  name: ubuntu-fk5a8-91b4d208t9nxv
  namespace: test6q55c
spec:
  containers:
  - args:
    - --display-addr
    - unix:///var/run/project/display.sock
    - --user-id
    - "9000"
    - --pulse-server
    - /run/user/9000/pulse/native
    image: ****/proxy:v1.0.156
    imagePullPolicy: IfNotPresent
    name: project-proxy
    ports:
    - containerPort: 8443
      name: web
      protocol: TCP
    resources:
      requests:
        cpu: 150m
        memory: 50M
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /tmp
      name: tmp
    - mountPath: /run
      name: run
    - mountPath: /run/lock
      name: run-lock
    - mountPath: /etc/project/tls/server
      name: tls
      readOnly: true
    - mountPath: /var/run/project
      name: vnc-sock
    - mountPath: /mnt/home
      name: home
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-wnppn
      readOnly: true
  - env:
    - name: USER
      value: project
    - name: UID
      value: "9000"
    - name: HOME
      value: /home/project
    - name: LANG
      value: en_US.UTF-8
    - name: LANGUAGE
      value: en_US:en
    - name: DISPLAY_SOCK_ADDR
      value: /var/run/project/display.sock
    - name: ENABLE_ROOT
      value: "true"
    image: ****/ubuntu-xfce4:v1.0.0
    imagePullPolicy: IfNotPresent
    lifecycle:
      preStop:
        exec:
          command:
          - kill
          - -s
          - SIGRTMIN+3
          - "1"
    name: desktop
    resources:
      limits:
        nvidia.com/gpu: "1"
      requests:
        nvidia.com/gpu: "1"
    securityContext:
      capabilities:
        add:
        - SYS_ADMIN
      privileged: true
      runAsUser: 0
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /run
      name: run
    - mountPath: /run/lock
      name: run-lock
    - mountPath: /dev/shm
      name: shm
    - mountPath: /home/user
      name: home
    - mountPath: /tmp
      name: tmp
    - mountPath: /var/run/user
      name: vnc-sock
    - mountPath: /sys/fs/cgroup
      name: cgroups
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-wnppn
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext:
    fsGroup: 9000
    runAsNonRoot: false
  serviceAccount: default
  serviceAccountName: default
  subdomain: ubuntu-fk5a8-91b4d208t9nxv
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  - effect: NoSchedule
    key: nvidia.com/gpu
    operator: Exists
  volumes:
  - emptyDir: {}
    name: run
  - emptyDir: {}
    name: run-lock
  - hostPath:
      path: /dev/shm
      type: ""
    name: shm
  - name: tls
    secret:
      defaultMode: 420
      secretName: ubuntu-fk5a8-91b4d208t9nxv
  - emptyDir: {}
    name: tmp
  - emptyDir: {}
    name: vnc-sock
  - emptyDir: {}
    name: home
  - hostPath:
      path: /sys/fs/cgroup
      type: ""
    name: cgroups
  - name: kube-api-access-wnppn
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace

DRC · Answer 1 · Thu May 09 2024 22:36:34 GMT+0800 (China Standard Time)

As specified in the User's Guide, the device should be a DRI device path (/dev/dri/card[0-9]) or an EGL device name (egl[0-9].)