siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.

Home Page:https://www.talos.dev

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Nvidia system extensions not persisting after node reboot.

dsdole opened this issue · comments

Bug Report

Description

Nvidia system extensions are not persisting after a reboot. The nvidia driver and runtime work perfectly when the node is freshly provisoned. After a reboot they do not show in "talosctl get extensions -n talos-04" and the modules are not loaded either.

Logs

Environment

  • 3 control plane nodes and 1 worker. All nodes are schedulable.
  • The worker has an Nvidia RTX 3070.
  • The nodes are proxmox VMs provisioned and bootstrapped by terraform.
  • The worker node has a single machineconfig patch "nvidia-gpu.yaml".
  • The install image specified includes the following system extensions:
    • siderolabs/nvidia-container-toolkit (535.129.03-v1.14.6)
    • siderolabs/nonfree-kmod-nvidia (535.129.03-v1.7.0)

nvidia-gpu.yaml:

machine:
  kernel:
    modules:
      - name: nvidia
      - name: nvidia_uvm
      - name: nvidia_drm
      - name: nvidia_modeset
  sysctls:
    net.core.bpf_jit_harden: 1
  install:
    image: factory.talos.dev/installer/0412a9a6369c0fb55e913cdfcbf4ad6ca3fab6e56ab71198ec4b58ad7e7a4ddd:v1.7.0
  • Talos version: [talosctl version --nodes <problematic nodes>]
Client:
        Tag:         v1.7.0
        SHA:         70fb41ff
        Built:       
        Go version:  go1.22.2
        OS/Arch:     linux/amd64
Server:
        NODE:        talos-04
        Tag:         v1.7.0
        SHA:         70fb41ff
        Built:       
        Go version:  go1.22.2
        OS/Arch:     linux/amd64
        Enabled:     RBAC

  • Kubernetes version: [kubectl version --short]
Client Version: v1.29.4
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.4
  • Platform:

Most probably you're booting not from disk, e.g. from an ISO. Or vice versa, booting off a disk image and the install never happened.

See boot assets guide, or use Image Factory wizard

Changing the boot order to boot from disk first fixed the issue. Just wanted to point out that on this page of the [docs](https://www.talos.dev/v1.7/talos-guides/install/bare-metal-platforms/iso/), while it does state that the boot order should prefer the disk over ISO, it also says that if it boots from the ISO it will boot into an existing installation. Based on the docs, there is no reason to think that extensions would not load. Thanks for your help