pulsepointinc / NVIDIA-driver-container

NVIDIA driver container for Rocky Linux (and maybe AlmaLinux and Oracle Linux)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

NVIDIA GPU Operator driver container for Rocky Linux (and maybe AlmaLinux and Oracle Linux) 8

NVIDIA currently does not support Rocky Linux (or AlmaLinux, or other modern Enterprise Linux clones) in the GPU operator for Kubernetes, making it a challenge to get the NVIDIA operator stood up on a cluster that is running on one of these EL clones. This container image aims to helm with that (though this is only tested with Rocky).

A few changes are necessary to make the RHEL 8 container image build on non-RHEL systems:

  • The OS-sniffing logic here needs to be updated to include identifiers for rocky, almalinux, and ol
  • The target kernel and supporting dependencies necessary to build the NVIDIA kernel modules must be installed into the container and the bootstrapping logic for dependencies here needs to be modified
  • The container image must be installed in a container registry with a tag specific to the target OS + release version for the cluster

Prerequisites

  • A container registry you are already authenticated with that you can publish to
  • A host running the same OS and kernel version as your GPU Kubernetes hosts, to run this build on

Building

To build on Rocky 8, you will just need to override the CONTAINER_REGISTRY env var to point to the registry of your choice.

On AlmaLinux or Oracle Linux, you will also need to update the RPM_BASE_URL env var to point to the BaseOS RPM repo for your OS + architecture.

If you wish to build a different driver version than 535.104.12, override the NVIDIA_DRIVER_VERSION env var as well.

Running ./build.sh after exporting any env vars should build and publish the container to $CONTAINER_REGISTRY/nvidia/driver:$NVIDIA_DRIVER_VERSION-${OS_NAME}${OS_RELEASE}

Deploying

Deploy the operator helm chart with the values for CONTAINER_REGISTRY and NVIDIA_DRIVER_VERSION - as an example:

export CONTAINER_REGISTRY=container-registry.siomporas.com
export NVIDIA_DRIVER_VERSION=535.104.12
helm install --generate-name \
     -n gpu-operator --create-namespace \
     nvidia/gpu-operator \
     --set driver.repository=$CONTAINER_REGISTRY/nvidia \
     --set driver.version=$NVIDIA_DRIVER_VERSION

Inspired by this (which no longer works).

About

NVIDIA driver container for Rocky Linux (and maybe AlmaLinux and Oracle Linux)


Languages

Language:Shell 98.9%Language:Dockerfile 1.1%