nvidia-smi 报错 Failed to initialize NVML: Unknown Error
mattshma opened this issue · comments
mattshma commented
在容器中报告 nvidia-smi 时,报错如下:
# nvidia-smi
Failed to initialize NVML: Unknown Error
通过 strace
查了下,有如下报错:
open("/dev/nvidiactl", O_RDWR) = -1 EPERM (Operation not permitted)
open("/dev/nvidiactl", O_RDONLY) = -1 EPERM (Operation not permitted)
查了下,issues-77073 和 issues-966 中提到这点,看了下
$ cat /sys/fs/cgroup/devices/devices.list
c 1:5 rwm
c 1:3 rwm
c 1:9 rwm
c 1:8 rwm
c 5:0 rwm
c 5:1 rwm
c *:* m
b *:* m
c 1:7 rwm
c 136:* rwm
c 5:2 rwm
c 10:200 rwm
查看 kubernetes 配置,--cpu-manager-policy=static
,--feature-gates=CPUManager=true,SupportPodPidsLimit=true
。
mattshma commented
将该配置去掉,或修改为其他配置即可。较好的方式是修改 libnvidia-container。
WeiJie Li commented
将该配置去掉,或修改为其他配置即可。较好的方式是修改 libnvidia-container。
libnvidia-container如何修改?