kubernetes-retired / kube-batch

A batch scheduler of kubernetes for high performance workload, e.g. AI/ML, BigData, HPC

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Cache is not up to date and cause `UnexpectedAdmissionError`

zionwu opened this issue · comments

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:
In my company's production environment, we are using kube-batch v0.3.0 with tf-operator.
I found that sometimes the status of the pod of TF-Job is UnexpectedAdmissionError .
The result of describe pod is:

# kubectl describe pod  imgsr-11103465-20200312165729394-worker-0 -n 11103465 
Name:               imgsr-11103465-20200312165729394-worker-0
Namespace:          11103465
Priority:           0
PriorityClassName:  <none>
Node:              10.193.86.6
Start Time:         Thu, 12 Mar 2020 16:58:22 +0800
Labels:             group_name=kubeflow.org
                    tf-replica-index=0
                    tf-replica-type=worker
                    tf_job_name=imgsr-11103465-20200312165729394
Annotations:        <none>
Status:             Failed
Reason:             UnexpectedAdmissionError
Message:            Pod Update plugin resources failed due to requested number of devices unavailable for nvidia.com/gpu. Requested: 1, Available: 0, which is unexpected

I found this error is due to the node has no idle GPU resources, however, kube-batch still schedules the pod to the node. When kubelet starts the pod, it failed at the admission check.

The result of kubectl describe node shows the 4 GPUs of the node are already allocated.

# kubectl  describe node 10.193.86.6
Name:               10.193.86.6
Roles:              worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=10.193.86.6
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/worker=true
Annotations:        node.alpha.kubernetes.io/ttl: 15
CreationTimestamp:  Tue, 19 Nov 2019 17:56:49 +0800
Taints:             nvidia.com/gpu=v100:NoSchedule
Unschedulable:      false
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Tue, 19 Nov 2019 17:57:12 +0800   Tue, 19 Nov 2019 17:57:12 +0800   CalicoIsUp                   Calico is running on this node
  MemoryPressure       False   Fri, 13 Mar 2020 15:31:24 +0800   Tue, 19 Nov 2019 17:56:49 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Fri, 13 Mar 2020 15:31:24 +0800   Tue, 19 Nov 2019 17:56:49 +0800   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Fri, 13 Mar 2020 15:31:24 +0800   Tue, 19 Nov 2019 17:56:49 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Fri, 13 Mar 2020 15:31:24 +0800   Tue, 19 Nov 2019 17:57:19 +0800   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  10.193.86.6
  Hostname:    10.193.86.6
Capacity:
 cpu:                64
 ephemeral-storage:  217021920Ki
 hugepages-1Gi:      0
 hugepages-2Mi:      0
 memory:             526997368Ki
 nvidia.com/gpu:     4
 pods:               110
Allocatable:
 cpu:                64
 ephemeral-storage:  200007401141
 hugepages-1Gi:      0
 hugepages-2Mi:      0
 memory:             526894968Ki
 nvidia.com/gpu:     4
 pods:               110
System Info:
 Machine ID:                 311936e2303b034fe7ef70182235b8cb
 System UUID:                F8BF926C-F0C4-03E4-B211-D21D202DF91A
 Boot ID:                    3a719cd5-e3e5-4113-8642-fd423110733f
 Kernel Version:             4.9.99
 OS Image:                   CentOS Linux 7 (Core)
 Operating System:           linux
 Architecture:               amd64
 Container Runtime Version:  docker://18.9.7
 Kubelet Version:            v1.14.3
 Kube-Proxy Version:         v1.14.3
PodCIDR:                     10.227.223.0/24
Non-terminated Pods:         (10 in total)
  Namespace                  Name                                    CPU Requests  CPU Limits  Memory Requests                    Memory Limits                      AGE
  ---------                  ----                                    ------------  ----------  ---------------                    -------------                      ---
  11069822                   v-dev-11069822-5db6df5868-nvmrh         4 (6%)        4 (6%)      9223372036854775807 (1709486672%)  9223372036854775807 (1709486672%)  20h
  11103451                   v-dev-11103451-8f85b9dfb-9swz8          32 (50%)      32 (50%)    200Gi (39%)                        200Gi (39%)                        3d22h
  11103931                   v-dev-11103931-5c8bdf8c5d-r68d6         10 (15%)      10 (15%)    20Gi (3%)                          20Gi (3%)                          3d22h
  kube-system                calico-node-j244l                       250m (0%)     0 (0%)      0 (0%)                             0 (0%)                             114d
  kube-system                fluentd-es-jghrk                        100m (0%)     0 (0%)      200Mi (0%)                         500Mi (0%)                         67d
  kube-system                nvidia-device-plugin-daemonset-hnczj    0 (0%)        0 (0%)      0 (0%)                             0 (0%)                             10d
  monitoring                 cadvisor-46755                          200m (0%)     2 (3%)      200Mi (0%)                         2000Mi (0%)                        82d
  monitoring                 dcgm-exporter-c629r                     100m (0%)     200m (0%)   30Mi (0%)                          50Mi (0%)                          114d
  monitoring                 monitor-gfsclient-r8f48                 100m (0%)     300m (0%)   200Mi (0%)                         500Mi (0%)                         11d
  monitoring                 node-exporter-jgxzg                     200m (0%)     1 (1%)      50Mi (0%)                          1Gi (0%)                           24d
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests                            Limits
  --------           --------                            ------
  cpu                46950m (73%)                        49500m (77%)
  memory             9223372273791008767 (-1709486628%)  9223372277349875711 (-1709486627%)
  ephemeral-storage  0 (0%)                              0 (0%)
  nvidia.com/gpu     4                                   4
Events:              <none>

I printed the Cache of the kube-batch and found the pod v-dev-11069822-5db6df5868-nvmrh using 1 GPU is missing from the cache so that kube-batch assume the node still has one idle GPU:

        10.193.86.6: idle(cpu 21050.00, memory 302604214272.00, GPU 1000.00) used(cpu 42950.00, memory 236936232960.00, GPU 3000.00) allocatable(cpu 64000.00, memory 539540447232.00, GPU 4000.00) pods(9)
                 0: Task (4a5d09a5-61e5-11ea-bb04-6c92bfdaad6a:11103931/v-dev-11103931-5c8bdf8c5d-r68d6): job 4a5c526e-61e5-11ea-bb04-6c92bfdaad6a, status Running, pri 0, resreq cpu 10000.00, memory 21474836480.00, GPU 1000.00, node 10.193.86.6
                 1: Task (d99d1a9a-61e7-11ea-bb04-6c92bfdaad6a:11103451/v-dev-11103451-8f85b9dfb-9swz8): job d99c5a25-61e7-11ea-bb04-6c92bfdaad6a, status Running, pri 0, resreq cpu 32000.00, memory 214748364800.00, GPU 2000.00, node 10.193.86.6
                 2: Task (82c45cce-23d1-11ea-99fd-6c92bfdaad6a:monitoring/cadvisor-46755): job 7efad7af-23d1-11ea-b40d-6c92bfd6b6ac, status Running, pri 0, resreq cpu 200.00, memory 209715200.00, GPU 0.00, node 10.193.86.6
                 3: Task (eebd307e-5c51-11ea-bb04-6c92bfdaad6a:monitoring/monitor-gfsclient-r8f48): job e844e81c-5c51-11ea-b0bd-6c92bfdad6b6, status Running, pri 0, resreq cpu 100.00, memory 209715200.00, GPU 0.00, node 10.193.86.6
                 4: Task (98339006-3029-11ea-99fd-6c92bfdaad6a:kube-system/fluentd-es-jghrk): job 96429856-3029-11ea-b40d-6c92bfd6b6ac, status Running, pri 0, resreq cpu 100.00, memory 209715200.00, GPU 0.00, node 10.193.86.6
                 5: Task (f99074a8-0ab2-11ea-99fd-6c92bfdaad6a:monitoring/dcgm-exporter-c629r): job bed65e78-d3c6-11e9-b40d-6c92bfd6b6ac, status Running, pri 0, resreq cpu 100.00, memory 31457280.00, GPU 0.00, node 10.193.86.6
                 6: Task (28d8238e-5c6d-11ea-bb04-6c92bfdaad6a:kube-system/nvidia-device-plugin-daemonset-hnczj): job f8328f1b-df5c-11e9-b40d-6c92bfd6b6ac, status Running, pri 0, resreq cpu 0.00, memory 0.00, GPU 0.00, node 10.193.86.6
                 7: Task (e7ab98b1-0ab2-11ea-99fd-6c92bfdaad6a:kube-system/calico-node-j244l): job 6d43b891-d3c5-11e9-b40d-6c92bfd6b6ac, status Running, pri 0, resreq cpu 250.00, memory 0.00, GPU 0.00, node 10.193.86.6
                 8: Task (0fff0258-5162-11ea-99fd-6c92bfdaad6a:monitoring/node-exporter-jgxzg): job 0c81a483-5162-11ea-8bd6-6c92bfdad6b6, status Running, pri 0, resreq cpu 200.00, memory 52428800.00, GPU 0.00, node 10.193.86.6

@k82cn Could you please help? why the cache is not up to date?

What you expected to happen:
The resources of the node in the cache are the correct so that it will not cause UnexpectedAdmissionError

In cache.go, the defaultResync is 0 when creating informerFactory:

informerFactory := informers.NewSharedInformerFactory(sc.kubeclient, 0)

According to the doc https://godoc.org/k8s.io/client-go/tools/cache#NewSharedIndexInformer:

The created informer will not do resyncs if the given defaultEventHandlerResyncPeriod is zero.

Is this the reason why cache is not up not date? Should we set defaultResync to non-zero?

Is this is the reason why cache is not up not date? Should we set defaultResync to non-zero?

I'm ok to do that; but I'm not sure why it's not updated :(

@k82cn
Finally, I found out the root cause for this issue:

  • A GPU Node had a GPU task running with OnFailure RestartPolicy.
  • The Node accidentally rebooted due to some reason.
  • After the node was up, the task restarted, the Cache received pod add event and SchedulerCache.AddPod is called, but it failed:

E0408 13:04:33.159617 1 event_handlers.go:167] Failed to add pod <11090917/wzy-gpu-runsh-20200408154720486-worker-0> into cache: Selected node NotReady

  • This is because when add the task to the node , it will check ti.Resreq.LessEqual(ni.Idle). Th nvidia-device-plugin is just up and the node's allocatable gpu is still 0, so the check failed and it return the above error.

  • The Cache received pod update event and SchedulerCache.UpdatePod is called, but it still failed:

E0408 13:04:46.991721 1 event_handlers.go:192] Failed to update pod wzy-gpu-runsh-20200408154720486-worker-0 in cache: errors: 1: failed to find task <11090917/wzy-gpu-runsh-20200408154720486-worker-0> in job <11090917/wzy-gpu-runsh-20200408154720486>, 2: failed to find task <11090917/wzy-gpu-runsh-20200408154720486-worker-0> on host <10.196.2.7>

  • This is because UpdatePod will first deletePod, then addPod. deletePod failed because the task is not found in the job and on the host. The error is returned and addPod is never called.

The fix for this issue to ignore the error of deletePod and continue to call addPod. I will have a PR for it.

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.