utkuozdemir / nvidia_gpu_exporter

Nvidia GPU exporter for prometheus using nvidia-smi binary

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

pod running error

bilbilmyc opened this issue · comments

Describe the bug

I used helm to install nvidia_gpu_exporter. I only changed values.yml. pod is running normally, but pod is reporting errors. error="command failed. stderr: err: exit status 12"

To Reproduce
Steps to reproduce the behavior:
https://artifacthub.io/packages/helm/utkuozdemir/nvidia-gpu-exporter

  1. This is my values.yml
image:
  repository: docker.io/utkuozdemir/nvidia_gpu_exporter
  pullPolicy: IfNotPresent
  tag: ""

imagePullSecrets: []
nameOverride: ""
fullnameOverride: ""

serviceAccount:
  create: true
  annotations: {}
  name: ""

podAnnotations: {}

podSecurityContext: {}

securityContext:
  privileged: true

service:
  type: NodePort
  port: 9835
  nodePort: 30235

ingress:
  enabled: false
  className: ""
  annotations: {}
  hosts:
    - host: chart-example.local
      paths:
        - path: /
          pathType: ImplementationSpecific
  tls: []

resources: {}

nodeSelector: {}

tolerations: {}

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: NotIn
          values:
          - pt01
          - pt02
          - pt03

port: 9835

hostPort:
  enabled: false
  port: 9835

log:
  level: info
  format: logfmt

queryFieldNames:
  - AUTO

nvidiaSmiCommand: nvidia-smi

telemetryPath: /metrics

volumes:
  - name: nvidiactl
    hostPath:
      path: /dev/nvidiactl
  - name: nvidia0
    hostPath:
      path: /dev/nvidia0
  - name: nvidia-smi
    hostPath:
      path: /usr/bin/nvidia-smi
  - name: libnvidia-ml-so
    hostPath:
      path: /usr/lib/libnvidia-ml.so
  - name: libnvidia-ml-so-1
    hostPath:
      path: /usr/lib/libnvidia-ml.so.1

volumeMounts:
  - name: nvidiactl
    mountPath: /dev/nvidiactl
  - name: nvidia0
    mountPath: /dev/nvidia0
  - name: nvidia-smi
    mountPath: /usr/bin/nvidia-smi
  - name: libnvidia-ml-so
    mountPath: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so
  - name: libnvidia-ml-so-1
    mountPath: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1

serviceMonitor:
  enabled: false
  additionalLabels: {}
  scheme: http
  bearerTokenFile:
  interval:
  tlsConfig: {}
  proxyUrl: ""
  relabelings: []
  metricRelabelings: []
  scrapeTimeout: 10s
  1. this is my driver
[root@g105 ~]# ll /usr/lib/libnvidia-ml.so.1 /usr/lib/libnvidia-ml.so /dev/nvidiactl /dev/nvidia0 /usr/bin/nvidia-smi
crw-rw-rw- 1 root root 195,   0 Jan 18 17:34 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Jan 18 17:34 /dev/nvidiactl
-rwxr-xr-x 1 root root   634504 Jan 17 13:25 /usr/bin/nvidia-smi
lrwxrwxrwx 1 root root       17 Jan 17 13:25 /usr/lib/libnvidia-ml.so -> libnvidia-ml.so.1
lrwxrwxrwx 1 root root       25 Jan 17 13:25 /usr/lib/libnvidia-ml.so.1 -> libnvidia-ml.so.530.30.02
  1. this is pod status
$ kubectl get pod -n mayunchao
NAME                                     READY   STATUS    RESTARTS   AGE
gpu-exporter-nvidia-gpu-exporter-2lh74   1/1     Running   0          8m7s
gpu-exporter-nvidia-gpu-exporter-474rj   1/1     Running   0          8m44s
gpu-exporter-nvidia-gpu-exporter-6sdxd   1/1     Running   0          8m39s
gpu-exporter-nvidia-gpu-exporter-9xssr   1/1     Running   0          7m40s
gpu-exporter-nvidia-gpu-exporter-b5cpq   1/1     Running   0          6m56s
gpu-exporter-nvidia-gpu-exporter-brrlx   1/1     Running   0          7m30s
gpu-exporter-nvidia-gpu-exporter-dv4z7   1/1     Running   0          7m15s
gpu-exporter-nvidia-gpu-exporter-fcbbn   1/1     Running   0          6m39s
gpu-exporter-nvidia-gpu-exporter-g8gwq   1/1     Running   0          8m27s
gpu-exporter-nvidia-gpu-exporter-grbrt   1/1     Running   0          7m1s
gpu-exporter-nvidia-gpu-exporter-ms5dn   1/1     Running   0          6m49s
gpu-exporter-nvidia-gpu-exporter-pjfpj   1/1     Running   0          8m20s
gpu-exporter-nvidia-gpu-exporter-qzqg6   1/1     Running   0          7m52s
gpu-exporter-nvidia-gpu-exporter-z6sxz   1/1     Running   0          9m7s
gpu-exporter-nvidia-gpu-exporter-zt82b   1/1     Running   0          8m58s

Expected behavior

I expect the POD to run properly and collect data

Console output

$ kubectl logs -n mayunchao gpu-exporter-nvidia-gpu-exporter-6sdxd

level=warn ts=2024-03-07T08:49:46.506Z caller=exporter.go:101 msg="Failed to auto-determine query field names, falling back to the built-in list"
level=info ts=2024-03-07T08:49:46.509Z caller=main.go:65 msg="Listening on address" address=:9835
level=info ts=2024-03-07T08:49:46.510Z caller=tls_config.go:191 msg="TLS is disabled." http2=false
level=error ts=2024-03-07T08:49:50.685Z caller=exporter.go:148 error="command failed. stderr:  err: exit status 12"
level=error ts=2024-03-07T08:50:04.185Z caller=exporter.go:148 error="command failed. stderr:  err: exit status 12"
level=error ts=2024-03-07T08:50:05.663Z caller=exporter.go:148 error="command failed. stderr:  err: exit status 12"
level=error ts=2024-03-07T08:50:19.164Z caller=exporter.go:148 error="command failed. stderr:  err: exit status 12"
level=error ts=2024-03-07T08:50:20.663Z caller=exporter.go:148 error="command failed. stderr:  err: exit status 12"
level=error ts=2024-03-07T08:50:34.163Z caller=exporter.go:148 error="command failed. stderr:  err: exit status 12"
level=error ts=2024-03-07T08:50:35.663Z caller=exporter.go:148 error="command failed. stderr:  err: exit status 12"
level=error ts=2024-03-07T08:50:49.163Z caller=exporter.go:148 error="command failed. stderr:  err: exit status 12"
level=error ts=2024-03-07T08:50:50.663Z caller=exporter.go:148 error="command failed. stderr:  err: exit status 12"
level=error ts=2024-03-07T08:51:04.163Z caller=exporter.go:148 error="command failed. stderr:  err: exit status 12"
level=error ts=2024-03-07T08:51:05.662Z caller=exporter.go:148 error="command failed. stderr:  err: exit status 12"
level=error ts=2024-03-07T08:51:19.164Z caller=exporter.go:148 error="command failed. stderr:  err: exit status 12"
level=error ts=2024-03-07T08:51:20.668Z caller=exporter.go:148 error="command failed. stderr:  err: exit status 12"
level=error ts=2024-03-07T08:51:34.164Z caller=exporter.go:148 error="command failed. stderr:  err: exit status 12"
level=error ts=2024-03-07T08:51:35.662Z caller=exporter.go:148 error="command failed. stderr:  err: exit status 12"
level=error ts=2024-03-07T08:51:49.164Z caller=exporter.go:148 error="command failed. stderr:  err: exit status 12"
level=error ts=2024-03-07T08:51:50.663Z caller=exporter.go:148 error="command failed. stderr:  err: exit status 12"
level=error ts=2024-03-07T08:52:04.164Z caller=exporter.go:148 error="command failed. stderr:  err: exit status 12"
level=error ts=2024-03-07T08:52:05.663Z caller=exporter.go:148 error="command failed. stderr:  err: exit status 12"

Model and Version

  • GPU Model [e.g. NVIDIA GeForce RTX 4090]
  • App version and architecture [e.g. appVersion: 0.3.0, helm chart]
  • Installation method [e.g. helm]
  • Operating System [e.g. CentOS Linux release 7.9.2009 (Core), ]
  • Nvidia GPU driver version [e.g. NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1]

Additional context
Add any other context about the problem here.

I enter the POD and execute the command

kubectl exec -it gpu-exporter-nvidia-gpu-exporter-2lh74 -n mayunchao bash
root@gpu-exporter-nvidia-gpu-exporter-2lh74:/# ll /usr/lib/x86_64-linux-gnu/libnvidia-ml.so
-rwxr-xr-x 1 root root 1784524 Nov  4 01:14 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so*
root@gpu-exporter-nvidia-gpu-exporter-2lh74:/# nvidia-smi
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.