ROCm / ROCm

AMD ROCm™ Software - GitHub Home

Home Page:https://rocm.docs.amd.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Feature]: CDI generator for a universal container support

Scapal opened this issue · comments

Suggestion Description

In order to make it easier way to expose AMD GPUs in containers, ROCm should embrace the Container Device Interface.

It should be quite straightforward to implement as it is just about naming the devices in a yaml file and having hooks.

cdiVersion: 0.5.0
kind: amd.com/gpu
devices:
- name: "0"
  containerEdits:
    deviceNodes:
    - path: /dev/kfd
    - path: /dev/dri/renderD128
- name: "1"
  containerEdits:
    deviceNodes:
    - path: /dev/kfd
    - path: /dev/dri/renderD130
- name: "all"
  containerEdits:
    deviceNodes:
    - path: /dev/kfd
    - path: /dev/dri/renderD128
    - path: /dev/dri/renderD130

CDI is now supported by Docker (v25), containers, CRI-O and Podman.

For example, for Docker or Podman, you can then specify --device amd.com/gpu/1 instead of --device /dev/kfd --device /dev/dri/renderD130

Operating System

Linux

GPU

No response

ROCm Component

No response

Hello @Scapal! I'm currently working on integrating CDI for LXD (https://github.com/canonical/lxd/tree/main) and would be very interested in having a CDI generator for AMD GPUs! Also, like in nvidia-ctk (https://github.com/NVIDIA/nvidia-container-toolkit/blob/main/cmd/nvidia-ctk/README.md), the generated CDI spec should discover runtime libraries on the host to pass into the container namespace and list them in the CDI (with their associated hooks). Here is the example for an NVIDIA GPU:

{
  "cdiVersion": "0.5.0",
  "kind": "nvidia.com/gpu",
  "devices": [
    {
      "name": "0",
      "containerEdits": {
        "deviceNodes": [
          {
            "path": "/dev/nvidia0"
          },
          {
            "path": "/dev/dri/card1"
          },
          {
            "path": "/dev/dri/renderD128"
          }
        ],
        "hooks": [
          {
            "hookName": "createContainer",
            "path": "/usr/bin/nvidia-ctk",
            "args": [
              "nvidia-ctk",
              "hook",
              "chmod",
              "--mode",
              "755",
              "--path",
              "/dev/dri"
            ]
          }
        ]
      }
    },
    {
      "name": "GPU-8836a272-ca08-9629-24b6-54ca9b98cd30",
      "containerEdits": {
        "deviceNodes": [
          {
            "path": "/dev/nvidia0"
          },
          {
            "path": "/dev/dri/card1"
          },
          {
            "path": "/dev/dri/renderD128"
          }
        ],
        "hooks": [
          {
            "hookName": "createContainer",
            "path": "/usr/bin/nvidia-ctk",
            "args": [
              "nvidia-ctk",
              "hook",
              "chmod",
              "--mode",
              "755",
              "--path",
              "/dev/dri"
            ]
          }
        ]
      }
    },
    {
      "name": "all",
      "containerEdits": {
        "deviceNodes": [
          {
            "path": "/dev/nvidia0"
          },
          {
            "path": "/dev/dri/card1"
          },
          {
            "path": "/dev/dri/renderD128"
          }
        ],
        "hooks": [
          {
            "hookName": "createContainer",
            "path": "/usr/bin/nvidia-ctk",
            "args": [
              "nvidia-ctk",
              "hook",
              "chmod",
              "--mode",
              "755",
              "--path",
              "/dev/dri"
            ]
          }
        ]
      }
    }
  ],
  "containerEdits": {
    "env": [
      "NVIDIA_VISIBLE_DEVICES=void"
    ],
    "deviceNodes": [
      {
        "path": "/dev/nvidia-modeset"
      },
      {
        "path": "/dev/nvidia-uvm"
      },
      {
        "path": "/dev/nvidia-uvm-tools"
      },
      {
        "path": "/dev/nvidiactl"
      }
    ],
    "hooks": [
      {
        "hookName": "createContainer",
        "path": "/usr/bin/nvidia-ctk",
        "args": [
          "nvidia-ctk",
          "hook",
          "create-symlinks",
          "--link",
          "libglxserver_nvidia.so.535.161.08::/usr/lib/aarch64-linux-gnu/nvidia/xorg/libglxserver_nvidia.so"
        ]
      },
      {
        "hookName": "createContainer",
        "path": "/usr/bin/nvidia-ctk",
        "args": [
          "nvidia-ctk",
          "hook",
          "update-ldcache",
          "--folder",
          "/usr/lib/aarch64-linux-gnu"
        ]
      }
    ],
    "mounts": [
      {
        "hostPath": "/run/nvidia-persistenced/socket",
        "containerPath": "/run/nvidia-persistenced/socket",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind",
          "noexec"
        ]
      },
      {
        "hostPath": "/usr/bin/nvidia-cuda-mps-control",
        "containerPath": "/usr/bin/nvidia-cuda-mps-control",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/usr/bin/nvidia-cuda-mps-server",
        "containerPath": "/usr/bin/nvidia-cuda-mps-server",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/usr/bin/nvidia-debugdump",
        "containerPath": "/usr/bin/nvidia-debugdump",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/usr/bin/nvidia-persistenced",
        "containerPath": "/usr/bin/nvidia-persistenced",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/usr/bin/nvidia-smi",
        "containerPath": "/usr/bin/nvidia-smi",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/usr/lib/aarch64-linux-gnu/libEGL_nvidia.so.535.161.08",
        "containerPath": "/usr/lib/aarch64-linux-gnu/libEGL_nvidia.so.535.161.08",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/usr/lib/aarch64-linux-gnu/libGLESv1_CM_nvidia.so.535.161.08",
        "containerPath": "/usr/lib/aarch64-linux-gnu/libGLESv1_CM_nvidia.so.535.161.08",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/usr/lib/aarch64-linux-gnu/libGLESv2_nvidia.so.535.161.08",
        "containerPath": "/usr/lib/aarch64-linux-gnu/libGLESv2_nvidia.so.535.161.08",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/usr/lib/aarch64-linux-gnu/libGLX_nvidia.so.535.161.08",
        "containerPath": "/usr/lib/aarch64-linux-gnu/libGLX_nvidia.so.535.161.08",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/usr/lib/aarch64-linux-gnu/libcuda.so.535.161.08",
        "containerPath": "/usr/lib/aarch64-linux-gnu/libcuda.so.535.161.08",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/usr/lib/aarch64-linux-gnu/libcudadebugger.so.535.161.08",
        "containerPath": "/usr/lib/aarch64-linux-gnu/libcudadebugger.so.535.161.08",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/usr/lib/aarch64-linux-gnu/libnvcuvid.so.535.161.08",
        "containerPath": "/usr/lib/aarch64-linux-gnu/libnvcuvid.so.535.161.08",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/usr/lib/aarch64-linux-gnu/libnvidia-allocator.so.535.161.08",
        "containerPath": "/usr/lib/aarch64-linux-gnu/libnvidia-allocator.so.535.161.08",
        "options": [
          "ro",
          "nosudGPUid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/usr/lib/aarch64-linux-gnu/libnvidia-cfg.so.535.161.08",
        "containerPath": "/usr/lib/aarch64-linux-gnu/libnvidia-cfg.so.535.161.08",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/usr/lib/aarch64-linux-gnu/libnvidia-egl-gbm.so.1.1.0",
        "containerPath": "/usr/lib/aarch64-linux-gnu/libnvidia-egl-gbm.so.1.1.0",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/usr/lib/aarch64-linux-gnu/libnvidia-eglcore.so.535.161.08",
        "containerPath": "/usr/lib/aarch64-linux-gnu/libnvidia-eglcore.so.535.161.08",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/usr/lib/aarch64-linux-gnu/libnvidia-encode.so.535.161.08",
        "containerPath": "/usr/lib/aarch64-linux-gnu/libnvidia-encode.so.535.161.08",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/usr/lib/aarch64-linux-gnu/libnvidia-fbc.so.535.161.08",
        "containerPath": "/usr/lib/aarch64-linux-gnu/libnvidia-fbc.so.535.161.08",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/usr/lib/aarch64-linux-gnu/libnvidia-glcore.so.535.161.08",
        "containerPath": "/usr/lib/aarch64-linux-gnu/libnvidia-glcore.so.535.161.08",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/usr/lib/aarch64-linux-gnu/libnvidia-glsi.so.535.161.08",
        "containerPath": "/usr/lib/aarch64-linux-gnu/libnvidia-glsi.so.535.161.08",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/usr/lib/aarch64-linux-gnu/libnvidia-glvkspirv.so.535.161.08",
        "containerPath": "/usr/lib/aarch64-linux-gnu/libnvidia-glvkspirv.so.535.161.08",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/usr/lib/aarch64-linux-gnu/libnvidia-ml.so.535.161.08",
        "containerPath": "/usr/lib/aarch64-linux-gnu/libnvidia-ml.so.535.161.08",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/usr/lib/aarch64-linux-gnu/libnvidia-ngx.so.535.161.08",
        "containerPath": "/usr/lib/aarch64-linux-gnu/libnvidia-ngx.so.535.161.08",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/usr/lib/aarch64-linux-gnu/libnvidia-nvvm.so.535.161.08",
        "containerPath": "/usr/lib/aarch64-linux-gnu/libnvidia-nvvm.so.535.161.08",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/usr/lib/aarch64-linux-gnu/libnvidia-opencl.so.535.161.08",
        "containerPath": "/usr/lib/aarch64-linux-gnu/libnvidia-opencl.so.535.161.08",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/usr/lib/aarch64-linux-gnu/libnvidia-opticalflow.so.535.161.08",
        "containerPath": "/usr/lib/aarch64-linux-gnu/libnvidia-opticalflow.so.535.161.08",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/usr/lib/aarch64-linux-gnu/libnvidia-ptxjitcompiler.so.535.161.08",
        "containerPath": "/usr/lib/aarch64-linux-gnu/libnvidia-ptxjitcompiler.so.535.161.08",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/usr/lib/aarch64-linux-gnu/libnvidia-rtcore.so.535.161.08",
        "containerPath": "/usr/lib/aarch64-linux-gnu/libnvidia-rtcore.so.535.161.08",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/usr/lib/aarch64-linux-gnu/libnvidia-tls.so.535.161.08",
        "containerPath": "/usr/lib/aarch64-linux-gnu/libnvidia-tls.so.535.161.08",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/usr/lib/aarch64-linux-gnu/libnvidia-vulkan-producer.so.535.161.08",
        "containerPath": "/usr/lib/aarch64-linux-gnu/libnvidia-vulkan-producer.so.535.161.08",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/usr/lib/aarch64-linux-gnu/libnvoptix.so.535.161.08",
        "containerPath": "/usr/lib/aarch64-linux-gnu/libnvoptix.so.535.161.08",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/usr/share/nvidia/nvoptix.bin",
        "containerPath": "/usr/share/nvidia/nvoptix.bin",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/lib/firmware/nvidia/535.161.08/gsp_ga10x.bin",
        "containerPath": "/lib/firmware/nvidia/535.161.08/gsp_ga10x.bin",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/lib/firmware/nvidia/535.161.08/gsp_tu10x.bin",
        "containerPath": "/lib/firmware/nvidia/535.161.08/gsp_tu10x.bin",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/usr/share/X11/xorg.conf.d/10-nvidia.conf",
        "containerPath": "/usr/share/X11/xorg.conf.d/10-nvidia.conf",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/usr/share/egl/egl_external_platform.d/15_nvidia_gbm.json",
        "containerPath": "/usr/share/egl/egl_external_platform.d/15_nvidia_gbm.json",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/usr/share/glvnd/egl_vendor.d/10_nvidia.json",
        "containerPath": "/usr/share/glvnd/egl_vendor.d/10_nvidia.json",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/usr/share/vulkan/icd.d/nvidia_icd.json",
        "containerPath": "/usr/share/vulkan/icd.d/nvidia_icd.json",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/usr/share/vulkan/implicit_layer.d/nvidia_layers.json",
        "containerPath": "/usr/share/vulkan/implicit_layer.d/nvidia_layers.json",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/usr/lib/aarch64-linux-gnu/nvidia/xorg/libglxserver_nvidia.so.535.161.08",
        "containerPath": "/usr/lib/aarch64-linux-gnu/nvidia/xorg/libglxserver_nvidia.so.535.161.08",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      },
      {
        "hostPath": "/usr/lib/aarch64-linux-gnu/nvidia/xorg/nvidia_drv.so",
        "containerPath": "/usr/lib/aarch64-linux-gnu/nvidia/xorg/nvidia_drv.so",
        "options": [
          "ro",
          "nosuid",
          "nodev",
          "bind"
        ]
      }
    ]
  }
}

What would be even more amazing, is to apply filtering capabilities. NVIDIA is doing this for their nvidia-container-runtime through env variables to passthhrough only certain runtime libs. Example: only the libs related to compute or display. See here for more info. Note that they do not support this kind of filtering yet with CDI generation, but maybe they will at some point.

@Scapal as a maintainer of the CDI specification (as well as the NVIDIA tooling to generate CDI specifications for our devices), it would definitely be beneficial to have the ability to generate CDI specifications for ROCm devices.

As @gabrielmougard mentions, they are currently looking into what is required to add CDI support to LXC/LXD. This would add to the existing support in OCI-compliant runtimes such as Docker, Podman, Containerd, and Cri-o.