Mr-Linus / SCV

SCV is a distributed cluster GPU sniffer. SCV是一个分布式GPU嗅探器

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Error: failed to start container "scv"

IT-YUNMENGZE opened this issue · comments

执行:

[root@Master ~]# kubectl apply -f deploy.yaml
namespace/scv created
clusterrole.rbac.authorization.k8s.io/scv-cr created
serviceaccount/scv-sa created
clusterrolebinding.rbac.authorization.k8s.io/scv-crb created
daemonset.apps/scv-2 created

查看Pod的创建情况,状态卡在CrashLoopBackOff:

[root@Master ~]# kubectl get pods -o wide -n scv
NAME          READY   STATUS             RESTARTS   AGE     IP           NODE    NOMINATED NODE   READINESS GATES
scv-2-mmc78   0/1     CrashLoopBackOff   4          2m53s   10.244.1.3   node1   <none>           <none>
scv-2-tvxw6   0/1     CrashLoopBackOff   4          2m53s   10.244.2.3   node2   <none>           <none>

查看Pod的具体事件信息:

[root@Master ~]# kubectl describe pod scv-2-mmc78 -n scv
Name:         scv-2-mmc78
Namespace:    scv
Priority:     0
Node:         node1/192.168.108.129
Start Time:   Fri, 08 Oct 2021 17:03:04 +0800
Labels:       app=scv
              controller-revision-hash=6bb8c64d4f
              pod-template-generation=1
Annotations:  <none>
Status:       Running
IP:           10.244.1.3
IPs:
  IP:           10.244.1.3
Controlled By:  DaemonSet/scv-2
Containers:
  scv:
    Container ID:   docker://63674a568e806db75497f9559f8a8e6ad08104b8de68fa72e212298bd0ad8e50
    Image:          registry.cn-hangzhou.aliyuncs.com/njupt-isl/scv:2.0
    Image ID:       docker-pullable://registry.cn-hangzhou.aliyuncs.com/njupt-isl/scv@sha256:90cf73758ff07175d00953ec510ba4af5c96bb3b9c985c3dd55cbee079357329
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       ContainerCannotRun
      Message:      OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver error: failed to process request: unknown
      Exit Code:    128
      Started:      Fri, 08 Oct 2021 17:06:06 +0800
      Finished:     Fri, 08 Oct 2021 17:06:06 +0800
    Ready:          False
    Restart Count:  5
    Limits:
      memory:  200Mi
    Requests:
      cpu:     100m
      memory:  200Mi
    Environment:
      NODE_NAME:                (v1:spec.nodeName)
      NVIDIA_VISIBLE_DEVICES:  all
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from scv-sa-token-6zzls (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  scv-sa-token-6zzls:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  scv-sa-token-6zzls
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists
                 node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                 node.kubernetes.io/unreachable:NoExecute op=Exists
                 node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  4m17s                  default-scheduler  Successfully assigned scv/scv-2-mmc78 to node1
  Normal   Pulled     4m16s                  kubelet            Successfully pulled image "registry.cn-hangzhou.aliyuncs.com/njupt-isl/scv:2.0" in 960.991432ms
  Normal   Pulled     4m14s                  kubelet            Successfully pulled image "registry.cn-hangzhou.aliyuncs.com/njupt-isl/scv:2.0" in 871.025969ms
  Normal   Pulled     3m59s                  kubelet            Successfully pulled image "registry.cn-hangzhou.aliyuncs.com/njupt-isl/scv:2.0" in 880.149293ms
  Warning  Failed     3m32s (x4 over 4m16s)  kubelet            Error: failed to start container "scv": Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver error: failed to process request: unknown
  Normal   Pulled     3m32s                  kubelet            Successfully pulled image "registry.cn-hangzhou.aliyuncs.com/njupt-isl/scv:2.0" in 886.933766ms
  Warning  BackOff    2m56s (x6 over 3m56s)  kubelet            Back-off restarting failed container
  Normal   Pulling    2m45s (x5 over 4m17s)  kubelet            Pulling image "registry.cn-hangzhou.aliyuncs.com/njupt-isl/scv:2.0"
  Normal   Created    2m44s (x5 over 4m16s)  kubelet            Created container scv
  Normal   Pulled     2m44s                  kubelet            Successfully pulled image "registry.cn-hangzhou.aliyuncs.com/njupt-isl/scv:2.0" in 909.780206ms

错误信息:

Error: failed to start container "scv": Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver error: failed to process request: unknown

环境:

[root@Master ~]# kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.0", GitCommit:"af46c47ce925f4c4ad5cc8d1fca46c7b77d13b38", GitTreeState:"clean", BuildDate:"2020-12-08T17:59:43Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.0", GitCommit:"af46c47ce925f4c4ad5cc8d1fca46c7b77d13b38", GitTreeState:"clean", BuildDate:"2020-12-08T17:51:19Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}

已解决,是虚拟机显卡的问题。