defenseunicorns / leapfrogai

Production-ready Generative AI for local, cloud native, airgap, and edge deployments.

Home Page:https://leapfrog.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

fix: lingering GPU pods on cluster restart

CollectiveUnicorn opened this issue · comments

commented

Bug
GPU pods linger with the status UnexpectedAdmissionError on cluster restart and trigger replicas to be created.

Expected
No pods with UnexpectedAdmissionError linger in the cluster on restart.

Context
When the cluster is restarted, the existing GPU pods seem to fail with an UnexpectedAdmissionError due to not being able to allocate GPUs. This is believed to be due to a race condition with the nvidia-device-plugin daemonset which is required to allocate GPUs. This leads to additional replicas being spun up successfully. The old pods however remain until manual deletion. These do not affect functionality, but do lead to confusion.

Reproduce
Restart a cluster with running GPU pods.

Image