Cache is not up to date and cause `UnexpectedAdmissionError`
zionwu opened this issue · comments
Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
What happened:
In my company's production environment, we are using kube-batch v0.3.0 with tf-operator.
I found that sometimes the status of the pod of TF-Job is UnexpectedAdmissionError
.
The result of describe pod is:
# kubectl describe pod imgsr-11103465-20200312165729394-worker-0 -n 11103465
Name: imgsr-11103465-20200312165729394-worker-0
Namespace: 11103465
Priority: 0
PriorityClassName: <none>
Node: 10.193.86.6
Start Time: Thu, 12 Mar 2020 16:58:22 +0800
Labels: group_name=kubeflow.org
tf-replica-index=0
tf-replica-type=worker
tf_job_name=imgsr-11103465-20200312165729394
Annotations: <none>
Status: Failed
Reason: UnexpectedAdmissionError
Message: Pod Update plugin resources failed due to requested number of devices unavailable for nvidia.com/gpu. Requested: 1, Available: 0, which is unexpected
I found this error is due to the node has no idle GPU resources, however, kube-batch still schedules the pod to the node. When kubelet starts the pod, it failed at the admission check.
The result of kubectl describe node
shows the 4 GPUs of the node are already allocated.
# kubectl describe node 10.193.86.6
Name: 10.193.86.6
Roles: worker
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/arch=amd64
kubernetes.io/hostname=10.193.86.6
kubernetes.io/os=linux
node-role.kubernetes.io/worker=true
Annotations: node.alpha.kubernetes.io/ttl: 15
CreationTimestamp: Tue, 19 Nov 2019 17:56:49 +0800
Taints: nvidia.com/gpu=v100:NoSchedule
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Tue, 19 Nov 2019 17:57:12 +0800 Tue, 19 Nov 2019 17:57:12 +0800 CalicoIsUp Calico is running on this node
MemoryPressure False Fri, 13 Mar 2020 15:31:24 +0800 Tue, 19 Nov 2019 17:56:49 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Fri, 13 Mar 2020 15:31:24 +0800 Tue, 19 Nov 2019 17:56:49 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Fri, 13 Mar 2020 15:31:24 +0800 Tue, 19 Nov 2019 17:56:49 +0800 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Fri, 13 Mar 2020 15:31:24 +0800 Tue, 19 Nov 2019 17:57:19 +0800 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 10.193.86.6
Hostname: 10.193.86.6
Capacity:
cpu: 64
ephemeral-storage: 217021920Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 526997368Ki
nvidia.com/gpu: 4
pods: 110
Allocatable:
cpu: 64
ephemeral-storage: 200007401141
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 526894968Ki
nvidia.com/gpu: 4
pods: 110
System Info:
Machine ID: 311936e2303b034fe7ef70182235b8cb
System UUID: F8BF926C-F0C4-03E4-B211-D21D202DF91A
Boot ID: 3a719cd5-e3e5-4113-8642-fd423110733f
Kernel Version: 4.9.99
OS Image: CentOS Linux 7 (Core)
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://18.9.7
Kubelet Version: v1.14.3
Kube-Proxy Version: v1.14.3
PodCIDR: 10.227.223.0/24
Non-terminated Pods: (10 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
11069822 v-dev-11069822-5db6df5868-nvmrh 4 (6%) 4 (6%) 9223372036854775807 (1709486672%) 9223372036854775807 (1709486672%) 20h
11103451 v-dev-11103451-8f85b9dfb-9swz8 32 (50%) 32 (50%) 200Gi (39%) 200Gi (39%) 3d22h
11103931 v-dev-11103931-5c8bdf8c5d-r68d6 10 (15%) 10 (15%) 20Gi (3%) 20Gi (3%) 3d22h
kube-system calico-node-j244l 250m (0%) 0 (0%) 0 (0%) 0 (0%) 114d
kube-system fluentd-es-jghrk 100m (0%) 0 (0%) 200Mi (0%) 500Mi (0%) 67d
kube-system nvidia-device-plugin-daemonset-hnczj 0 (0%) 0 (0%) 0 (0%) 0 (0%) 10d
monitoring cadvisor-46755 200m (0%) 2 (3%) 200Mi (0%) 2000Mi (0%) 82d
monitoring dcgm-exporter-c629r 100m (0%) 200m (0%) 30Mi (0%) 50Mi (0%) 114d
monitoring monitor-gfsclient-r8f48 100m (0%) 300m (0%) 200Mi (0%) 500Mi (0%) 11d
monitoring node-exporter-jgxzg 200m (0%) 1 (1%) 50Mi (0%) 1Gi (0%) 24d
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 46950m (73%) 49500m (77%)
memory 9223372273791008767 (-1709486628%) 9223372277349875711 (-1709486627%)
ephemeral-storage 0 (0%) 0 (0%)
nvidia.com/gpu 4 4
Events: <none>
I printed the Cache of the kube-batch and found the pod v-dev-11069822-5db6df5868-nvmrh
using 1 GPU is missing from the cache so that kube-batch assume the node still has one idle GPU:
10.193.86.6: idle(cpu 21050.00, memory 302604214272.00, GPU 1000.00) used(cpu 42950.00, memory 236936232960.00, GPU 3000.00) allocatable(cpu 64000.00, memory 539540447232.00, GPU 4000.00) pods(9)
0: Task (4a5d09a5-61e5-11ea-bb04-6c92bfdaad6a:11103931/v-dev-11103931-5c8bdf8c5d-r68d6): job 4a5c526e-61e5-11ea-bb04-6c92bfdaad6a, status Running, pri 0, resreq cpu 10000.00, memory 21474836480.00, GPU 1000.00, node 10.193.86.6
1: Task (d99d1a9a-61e7-11ea-bb04-6c92bfdaad6a:11103451/v-dev-11103451-8f85b9dfb-9swz8): job d99c5a25-61e7-11ea-bb04-6c92bfdaad6a, status Running, pri 0, resreq cpu 32000.00, memory 214748364800.00, GPU 2000.00, node 10.193.86.6
2: Task (82c45cce-23d1-11ea-99fd-6c92bfdaad6a:monitoring/cadvisor-46755): job 7efad7af-23d1-11ea-b40d-6c92bfd6b6ac, status Running, pri 0, resreq cpu 200.00, memory 209715200.00, GPU 0.00, node 10.193.86.6
3: Task (eebd307e-5c51-11ea-bb04-6c92bfdaad6a:monitoring/monitor-gfsclient-r8f48): job e844e81c-5c51-11ea-b0bd-6c92bfdad6b6, status Running, pri 0, resreq cpu 100.00, memory 209715200.00, GPU 0.00, node 10.193.86.6
4: Task (98339006-3029-11ea-99fd-6c92bfdaad6a:kube-system/fluentd-es-jghrk): job 96429856-3029-11ea-b40d-6c92bfd6b6ac, status Running, pri 0, resreq cpu 100.00, memory 209715200.00, GPU 0.00, node 10.193.86.6
5: Task (f99074a8-0ab2-11ea-99fd-6c92bfdaad6a:monitoring/dcgm-exporter-c629r): job bed65e78-d3c6-11e9-b40d-6c92bfd6b6ac, status Running, pri 0, resreq cpu 100.00, memory 31457280.00, GPU 0.00, node 10.193.86.6
6: Task (28d8238e-5c6d-11ea-bb04-6c92bfdaad6a:kube-system/nvidia-device-plugin-daemonset-hnczj): job f8328f1b-df5c-11e9-b40d-6c92bfd6b6ac, status Running, pri 0, resreq cpu 0.00, memory 0.00, GPU 0.00, node 10.193.86.6
7: Task (e7ab98b1-0ab2-11ea-99fd-6c92bfdaad6a:kube-system/calico-node-j244l): job 6d43b891-d3c5-11e9-b40d-6c92bfd6b6ac, status Running, pri 0, resreq cpu 250.00, memory 0.00, GPU 0.00, node 10.193.86.6
8: Task (0fff0258-5162-11ea-99fd-6c92bfdaad6a:monitoring/node-exporter-jgxzg): job 0c81a483-5162-11ea-8bd6-6c92bfdad6b6, status Running, pri 0, resreq cpu 200.00, memory 52428800.00, GPU 0.00, node 10.193.86.6
@k82cn Could you please help? why the cache is not up to date?
What you expected to happen:
The resources of the node in the cache are the correct so that it will not cause UnexpectedAdmissionError
In cache.go, the defaultResync is 0 when creating informerFactory:
informerFactory := informers.NewSharedInformerFactory(sc.kubeclient, 0)
According to the doc https://godoc.org/k8s.io/client-go/tools/cache#NewSharedIndexInformer:
The created informer will not do resyncs if the given defaultEventHandlerResyncPeriod is zero.
Is this the reason why cache is not up not date? Should we set defaultResync to non-zero?
Is this is the reason why cache is not up not date? Should we set defaultResync to non-zero?
I'm ok to do that; but I'm not sure why it's not updated :(
@k82cn
Finally, I found out the root cause for this issue:
- A GPU Node had a GPU task running with
OnFailure
RestartPolicy. - The Node accidentally rebooted due to some reason.
- After the node was up, the task restarted, the Cache received pod add event and
SchedulerCache.AddPod
is called, but it failed:
E0408 13:04:33.159617 1 event_handlers.go:167] Failed to add pod <11090917/wzy-gpu-runsh-20200408154720486-worker-0> into cache: Selected node NotReady
-
This is because when add the task to the node , it will check
ti.Resreq.LessEqual(ni.Idle)
. Th nvidia-device-plugin is just up and the node's allocatable gpu is still 0, so the check failed and it return the above error. -
The Cache received pod update event and
SchedulerCache.UpdatePod
is called, but it still failed:
E0408 13:04:46.991721 1 event_handlers.go:192] Failed to update pod wzy-gpu-runsh-20200408154720486-worker-0 in cache: errors: 1: failed to find task <11090917/wzy-gpu-runsh-20200408154720486-worker-0> in job <11090917/wzy-gpu-runsh-20200408154720486>, 2: failed to find task <11090917/wzy-gpu-runsh-20200408154720486-worker-0> on host <10.196.2.7>
- This is because UpdatePod will first
deletePod
, thenaddPod
.deletePod
failed because the task is not found in the job and on the host. The error is returned andaddPod
is never called.
The fix for this issue to ignore the error of deletePod
and continue to call addPod
. I will have a PR for it.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen
.
Mark the issue as fresh with /remove-lifecycle rotten
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
@fejta-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue with/reopen
.
Mark the issue as fresh with/remove-lifecycle rotten
.Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.