kubernetes-retired / kube-batch

A batch scheduler of kubernetes for high performance workload, e.g. AI/ML, BigData, HPC

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

pod will stuck in pending status when host down

davidstack opened this issue · comments

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug

/kind feature

What happened:
pod stuck in pending status after bind to the host which is powered off
What you expected to happen:
pod will rescheduled when host is down
How to reproduce it (as minimally and precisely as possible):
because of the cluster snapshot, host status will not update immediately,so pod may be binded to a host which is powered off in this schedule period. but kube-batch will update this pod in Binding status. but actually, the pod is in pending status

Anything else we need to know?:
i think kube-batch should one goroutine to check this status ,and change the status of pod from binding to pennding in kube-batch cache.
Environment:

  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

Will GC delete the pod to trigger re-scheduling?

the node has been poweroff,kubelet can not gc ?

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.