fix: lingering GPU pods on cluster restart
CollectiveUnicorn opened this issue · comments
Bug
GPU pods linger with the status UnexpectedAdmissionError
on cluster restart and trigger replicas to be created.
Expected
No pods with UnexpectedAdmissionError
linger in the cluster on restart.
Context
When the cluster is restarted, the existing GPU pods seem to fail with an UnexpectedAdmissionError
due to not being able to allocate GPUs. This is believed to be due to a race condition with the nvidia-device-plugin
daemonset which is required to allocate GPUs. This leads to additional replicas being spun up successfully. The old pods however remain until manual deletion. These do not affect functionality, but do lead to confusion.
Reproduce
Restart a cluster with running GPU pods.