在有 3 个 gpu 的节点出现调度失败的情况

Question

在有 3 个 gpu 的节点出现调度失败的情况

scxu opened this issue 5 years ago · comments

想要实现 https://ieeexplore.ieee.org/abstract/document/8672318 这边文章中的内容。部署成功后发现其中一台有四卡的节点调度是没有问题的，但是另外一台只有三卡的机器会出现 Pending 的情况，也就是明明有资源但是 scheduler 说：

Events:
  Type     Reason            Age        From               Message
  ----     ------            ----       ----               -------
  Warning  FailedScheduling  <unknown>  default-scheduler  0/4 nodes are available: 1 Insufficient tencent.com/vcuda-core, 3 node(s) didn't match node selector.

这里是 kubectl describe node 的结果：

可以看到是只有两块卡被调度了。

之所以是有三块卡是因为有一块卡出了问题，把它屏蔽了。然后这种调度失败可以通过强制 kube-scheduler 重启的方式一定程度上解决，重启之后一般会正常一下，但是后面还会出类似的问题。

在有三块卡的机器上执行 nvidia-smi topo -mp 结果如下：

	GPU0	GPU1	GPU2	CPU Affinity
GPU0	 X 	SYS	SYS	0-11
GPU1	SYS	 X 	PIX	0-11
GPU2	SYS	PIX	 X 	0-11

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge

k8s 版本：1.17.3
nvidia driver 版本：440.59

jasonxie · Answer 1 · Wed Jun 24 2020 15:31:14 GMT+0800 (China Standard Time)

maybe same with #5