microk8s 1.24 - enabling gpu addon fails

Question

microk8s 1.24 - enabling gpu addon fails

ACodingfreak opened this issue 6 days ago · comments

Summary

I was using microk8s 1.29 previously as mentioned in
#4557

Then I downgraded to Microk8s 1.24 by performing remove and clean install of microk8s 1.24.
Now on enabling GPU I see following error

mm321:~$ microk8s.enable gpu
Infer repository core for addon gpu
Enabling NVIDIA GPU
Addon core/dns is already enabled
Enabling Helm 3
Fetching helm version v3.8.0.
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 12.9M  100 12.9M    0     0  9671k      0  0:00:01  0:00:01 --:--:-- 9664k
Helm 3 is enabled
Checking if NVIDIA driver is already installed
Using operator GPU driver
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /var/snap/microk8s/5872/credentials/client.config
Error: repository name (nvidia) already exists, please specify a different name
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /var/snap/microk8s/5872/credentials/client.config
NAME: gpu-operator
LAST DEPLOYED: Mon Jul  1 12:01:20 2024
NAMESPACE: gpu-operator-resources
STATUS: deployed
REVISION: 1
TEST SUITE: None
NVIDIA is enabled

It says nvidia is enabled but microk8s status says otherwise as shown below

mm321:~$ microk8s status
microk8s is running
high-availability: no
  datastore master nodes: 10.10.26.231:19001
  datastore standby nodes: none
addons:
  enabled:
    dashboard            # (core) The Kubernetes dashboard
    dns                  # (core) CoreDNS
    ha-cluster           # (core) Configure high availability on the current node
    helm3                # (core) Helm 3 - Kubernetes package manager
    hostpath-storage     # (core) Storage class; allocates storage from host directory
    ingress              # (core) Ingress controller for external access
    metallb              # (core) Loadbalancer for your Kubernetes cluster
    metrics-server       # (core) K8s Metrics Server for API access to service metrics
    storage              # (core) Alias to hostpath-storage add-on, deprecated
  disabled:
    community            # (core) The community addons repository
    gpu                  # (core) Automatic enablement of Nvidia CUDA
    helm                 # (core) Helm 2 - the package manager for Kubernetes
    host-access          # (core) Allow Pods connecting to Host services smoothly
    mayastor             # (core) OpenEBS MayaStor
    prometheus           # (core) Prometheus operator for monitoring and logging
    rbac                 # (core) Role-Based Access Control for authorisation
    registry             # (core) Private image registry exposed on localhost:32000

What Should Happen Instead?

Enabling GPu addon should be successful

Reproduction Steps

Install microk8s 1.24 in mm231 and gpu01
Add gpu01 in cluster with mm231
microk8s enable gpu in mm231

Introspection Report

inspection-report-20240701_121105.tar.gz

Can you suggest a fix?

No

Are you interested in contributing with a fix?

Not Sure