mattshma / bigdata

hadoop,hbase,storm,spark,etc..

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

nvidia-smi 报错 Failed to initialize NVML: Unknown Error

mattshma opened this issue · comments

在容器中报告 nvidia-smi 时,报错如下:

# nvidia-smi
Failed to initialize NVML: Unknown Error

通过 strace 查了下,有如下报错:

open("/dev/nvidiactl", O_RDWR)          = -1 EPERM (Operation not permitted)
open("/dev/nvidiactl", O_RDONLY)        = -1 EPERM (Operation not permitted)

查了下,issues-77073issues-966 中提到这点,看了下

$ cat /sys/fs/cgroup/devices/devices.list
c 1:5 rwm
c 1:3 rwm
c 1:9 rwm
c 1:8 rwm
c 5:0 rwm
c 5:1 rwm
c *:* m
b *:* m
c 1:7 rwm
c 136:* rwm
c 5:2 rwm
c 10:200 rwm

查看 kubernetes 配置,--cpu-manager-policy=static--feature-gates=CPUManager=true,SupportPodPidsLimit=true

将该配置去掉,或修改为其他配置即可。较好的方式是修改 libnvidia-container。

将该配置去掉,或修改为其他配置即可。较好的方式是修改 libnvidia-container。

libnvidia-container如何修改?