fix: lingering GPU pods on cluster restart

Question

fix: lingering GPU pods on cluster restart

CollectiveUnicorn opened this issue 4 months ago · comments

Bug
GPU pods linger with the status UnexpectedAdmissionError on cluster restart and trigger replicas to be created.

Expected
No pods with UnexpectedAdmissionError linger in the cluster on restart.

Context
When the cluster is restarted, the existing GPU pods seem to fail with an UnexpectedAdmissionError due to not being able to allocate GPUs. This is believed to be due to a race condition with the nvidia-device-plugin daemonset which is required to allocate GPUs. This leads to additional replicas being spun up successfully. The old pods however remain until manual deletion. These do not affect functionality, but do lead to confusion.

Reproduce
Restart a cluster with running GPU pods.