microsoft / pai

Resource scheduling and cluster management for AI

Home Page:https://openpai.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Failed create pod sandbox: rpc error: code = Unknown desc = failed to create a sandbox for pod

HaoLiuHust opened this issue · comments

Organization Name:

Short summary about the issue/question:
some node will raise "Failed create pod sandbox: rpc error: code = Unknown desc = failed to create a sandbox for pod", and the job will keeps waiting
Brief what process you are following:

How to reproduce it:

OpenPAI Environment:

  • OpenPAI version: v1.8.0
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Hardware (e.g. core number, memory size, storage size, GPU type etc.):
  • Others:

Anything else we need to know:
when adding node, I use default docker config(if I use a daemon.json, docker will failed to restart), and after add node, I changed /etc/docker/daemon to use my own config, I wonder if this is the reason

reinstall os to this node, and add it back, ok for now