tkestack / gpu-admission

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

update pod annotation failed

fighterhit opened this issue · comments

When I test the gpu-admission in k8s v1.13.5 ,I got the following error:

I0814 08:36:19.986356       1 gpu_predicate.go:493] failed to add annotation map[tencent.com/gpu-assigned:false tencent.com/predicate-gpu-idx-0:0 tencent.com/predicate-node:ai-1080ti-15 tencent.com/predicate-time:1597394179983794058] to pod 9a3a7c36-dd45-11ea-8e57-6c92bf66acae due to pods "test33" not found
I0814 08:36:19.986380       1 util.go:71] Determine if the container test33 needs GPU resource
I0814 08:36:19.986394       1 share.go:58] Pick up 0 , cores: 100, memory: 43
I0814 08:36:19.988944       1 gpu_predicate.go:493] failed to add annotation map[tencent.com/gpu-assigned:false tencent.com/predicate-gpu-idx-0:0 tencent.com/predicate-node:ai-1080ti-57 tencent.com/predicate-time:1597394179986399567] to pod 9a3a7c36-dd45-11ea-8e57-6c92bf66acae due to pods "test33" not found
I0814 08:36:19.988971       1 util.go:71] Determine if the container test33 needs GPU resource
I0814 08:36:19.988986       1 share.go:58] Pick up 0 , cores: 100, memory: 43
I0814 08:36:19.991268       1 gpu_predicate.go:493] failed to add annotation map[tencent.com/gpu-assigned:false tencent.com/predicate-gpu-idx-0:0 tencent.com/predicate-node:ai-1080ti-62 tencent.com/predicate-time:1597394179988992239] to pod 9a3a7c36-dd45-11ea-8e57-6c92bf66acae due to pods "test33" not found
...
I0814 08:36:19.992368       1 routes.go:81] GPUQuotaPredicate: extenderFilterResult = {"Nodes":{"metadata":{},"items":[]},"NodeNames":null,"FailedNodes":{"ai-1080ti-15":"update pod annotation failed","ai-1080ti-57":"update pod annotation failed","ai-1080ti-62":"update pod annotation failed"},"Error":""}

  • The pod yaml
Name:               test33
Namespace:          danlu-efficiency
Priority:           0
PriorityClassName:  <none>
Node:               <none>
Labels:             <none>
Annotations:        <none>
Status:             Pending
IP:
NominatedNodeName:  ai-1080ti-62
Containers:
  test33:
    Image:      danlu/tensorflow:tf1.9.0_py2_gpu_v0.1
    Port:       <none>
    Host Port:  <none>
    Command:
      /bin/bash
      -c
      sleep 100000000
    Limits:
      tencent.com/vcuda-core:    10
      tencent.com/vcuda-memory:  30
    Requests:
      tencent.com/vcuda-core:    10
      tencent.com/vcuda-memory:  30
    Environment:                 <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-p6lfp (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  default-token-p6lfp:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-p6lfp
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age                     From           Message
  ----     ------            ----                    ----           -------
  Warning  FailedScheduling  4m56s (x1581 over 19h)  gpu-admission  0/16 nodes are available: 1 node(s) were unschedulable, 12 Insufficient tencent.com/vcuda-core, 12 Insufficient tencent.com/vcuda-memory, 3 update pod annotation failed.

Nodes information:

  • ai-1080ti-15
qa-jenkins@fuxi-qa-3:~/vgpu$ kubectl get nodes
NAME           STATUS                     ROLES             AGE    VERSION
ai-1080ti-15   Ready                      nvidia            463d   v1.13.3
ai-1080ti-57   Ready                      1080ti            463d   v1.13.3
ai-1080ti-62   Ready                      nvidia418         442d   v1.13.5
fuxi-dl-42     Ready                      <none>            302d   v1.13.5
fuxi-dl-46     Ready                      <none>            464d   v1.13.3
fuxi-dl-47     Ready                      <none>            464d   v1.13.3
fuxi-dl-48     Ready                      <none>            442d   v1.13.5
fuxi-qa-10g    Ready                      1080ti,training   414d   v1.13.5
fuxi-qa-12g    Ready                      nvidia            414d   v1.13.5
fuxi-qa-14     Ready                      <none>            353d   v1.13.5
fuxi-qa-15     Ready                      <none>            353d   v1.13.5
fuxi-qa-16     Ready                      <none>            309d   v1.13.5
fuxi-qa-3      Ready,SchedulingDisabled   master            603d   v1.13.5
fuxi-qa-4      Ready                      <none>            464d   v1.13.3
fuxi-qa-5      Ready                      <none>            464d   v1.13.3
fuxi-qa-8g     Ready                      nvidia            464d   v1.13.3

NOTE: The nodes begining with 'ai' are GPU nodes and labeled with 'nvidia-device-enable=enable '. Some information about GPU nodes is as follows:

Name:               ai-1080ti-15
Roles:              nvidia
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    hardware=NVIDIAGPU
                    hardware-type=NVIDIAGPU
                    kubernetes.io/hostname=ai-1080ti-15
                    node-role.kubernetes.io/nvidia=GPU
                    nvidia-device-enable=enable
Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
                    node.alpha.kubernetes.io/ttl: 0
                    projectcalico.org/IPv4Address: 10.200.0.72/24
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Wed, 08 May 2019 14:28:18 +0800
Taints:             <none>
Unschedulable:      false
Conditions:
  Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------    -----------------                 ------------------                ------                       -------
  MemoryPressure   False     Fri, 14 Aug 2020 17:54:08 +0800   Thu, 30 Apr 2020 18:26:46 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False     Fri, 14 Aug 2020 17:54:08 +0800   Thu, 30 Apr 2020 18:26:46 +0800   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False     Fri, 14 Aug 2020 17:54:08 +0800   Thu, 30 Apr 2020 18:26:46 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True      Fri, 14 Aug 2020 17:54:08 +0800   Thu, 30 Apr 2020 18:26:46 +0800   KubeletReady                 kubelet is posting ready status
  OutOfDisk        Unknown   Wed, 08 May 2019 14:28:18 +0800   Wed, 08 May 2019 14:33:54 +0800   NodeStatusNeverUpdated       Kubelet never posted node status.
Addresses:
  InternalIP:  10.200.0.72
  Hostname:    ai-1080ti-15
Capacity:
 cpu:                       56
 ephemeral-storage:         1153070996Ki
 hugepages-1Gi:             0
 hugepages-2Mi:             0
 memory:                    264030000Ki
 nvidia.com/gpu:            8
 pods:                      110
 tencent.com/vcuda-core:    800
 tencent.com/vcuda-memory:  349
Allocatable:
 cpu:                       53
 ephemeral-storage:         1041195391675
 hugepages-1Gi:             0
 hugepages-2Mi:             0
 memory:                    251344688Ki
 nvidia.com/gpu:            8
 pods:                      110
 tencent.com/vcuda-core:    800
 tencent.com/vcuda-memory:  349
System Info:
 Machine ID:                   2030b7c755d0458cbe03ef3b39b9412b
 System UUID:                  00000000-0000-0000-0000-ac1f6b27b26a
 Boot ID:                      c9dce882-9bc3-478d-a81d-1a8dcfd02a4f
 Kernel Version:               4.19.0-0.bpo.8-amd64
 OS Image:                     Debian GNU/Linux 9 (stretch)
 Operating System:             linux
 Architecture:                 amd64
 Container Runtime Version:    docker://18.6.2
 Kubelet Version:              v1.13.3
 Kube-Proxy Version:           v1.13.3
...
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                  Requests            Limits
  --------                  --------            ------
  cpu                       51320m (96%)        93300m (176%)
  memory                    101122659840 (39%)  240384047Ki (95%)
  ephemeral-storage         0 (0%)              0 (0%)
  nvidia.com/gpu            7                   7
  tencent.com/vcuda-core    0                   0
  tencent.com/vcuda-memory  0                   0

  • ai-1080ti-57
Name:               ai-1080ti-57
Roles:              1080ti
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    hardware=NVIDIAGPU
                    hardware-type=NVIDIAGPU
                    kubernetes.io/hostname=ai-1080ti-57
                    node-role.kubernetes.io/1080ti=1080ti
                    nvidia-device-enable=enable
Annotations:        node.alpha.kubernetes.io/ttl: 0
                    projectcalico.org/IPv4Address: 10.90.1.126/24
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Wed, 08 May 2019 14:47:50 +0800
Taints:             <none>
Unschedulable:      false
Conditions:
  Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------    -----------------                 ------------------                ------                       -------
  MemoryPressure   False     Fri, 14 Aug 2020 17:56:33 +0800   Wed, 12 Aug 2020 19:44:29 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False     Fri, 14 Aug 2020 17:56:33 +0800   Wed, 12 Aug 2020 19:44:29 +0800   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False     Fri, 14 Aug 2020 17:56:33 +0800   Wed, 12 Aug 2020 19:44:29 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True      Fri, 14 Aug 2020 17:56:33 +0800   Wed, 12 Aug 2020 19:44:29 +0800   KubeletReady                 kubelet is posting ready status
  OutOfDisk        Unknown   Wed, 08 May 2019 14:47:50 +0800   Fri, 09 Aug 2019 11:58:18 +0800   NodeStatusNeverUpdated       Kubelet never posted node status.
Addresses:
  InternalIP:  10.90.1.126
  Hostname:    ai-1080ti-57
Capacity:
 cpu:                       56
 ephemeral-storage:         1152148172Ki
 hugepages-1Gi:             0
 hugepages-2Mi:             0
 memory:                    264029980Ki
 nvidia.com/gpu:            8
 pods:                      110
 tencent.com/vcuda-core:    800
 tencent.com/vcuda-memory:  349
Allocatable:
 cpu:                       53
 ephemeral-storage:         1040344917078
 hugepages-1Gi:             0
 hugepages-2Mi:             0
 memory:                    251344668Ki
 nvidia.com/gpu:            8
 pods:                      110
 tencent.com/vcuda-core:    800
 tencent.com/vcuda-memory:  349
System Info:
 Machine ID:                   3ff54e221e0d475bacbe8a68bd0dd2e2
 System UUID:                  00000000-0000-0000-0000-ac1f6b91d6e8
 Boot ID:                      efc67001-4c66-4fca-946c-d13f0931fcc2
 Kernel Version:               4.19.0-0.bpo.8-amd64
 OS Image:                     Debian GNU/Linux 9 (stretch)
 Operating System:             linux
 Architecture:                 amd64
 Container Runtime Version:    docker://18.6.2
 Kubelet Version:              v1.13.3
 Kube-Proxy Version:           v1.13.3
...
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                  Requests            Limits
  --------                  --------            ------
  cpu                       47730m (90%)        73100m (137%)
  memory                    111630921472 (43%)  186532353536 (72%)
  ephemeral-storage         0 (0%)              0 (0%)
  nvidia.com/gpu            7                   7
  tencent.com/vcuda-core    0                   0
  tencent.com/vcuda-memory  0                   0
Events:                     <none>
  • ai-1080ti-62
Name:               ai-1080ti-62
Roles:              nvidia418
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    hardware=NVIDIAGPU
                    hardware-type=NVIDIAGPU
                    kubernetes.io/hostname=ai-1080ti-62
                    node-role.kubernetes.io/nvidia418=nvidia418
                    nvidia-device-enable=enable
Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
                    node.alpha.kubernetes.io/ttl: 0
                    projectcalico.org/IPv4Address: 10.90.1.131/24
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Wed, 29 May 2019 18:02:54 +0800
Taints:             <none>
Unschedulable:      false
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Fri, 14 Aug 2020 17:57:17 +0800   Thu, 30 Jul 2020 16:42:25 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Fri, 14 Aug 2020 17:57:17 +0800   Thu, 30 Jul 2020 16:42:25 +0800   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Fri, 14 Aug 2020 17:57:17 +0800   Thu, 30 Jul 2020 16:42:25 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Fri, 14 Aug 2020 17:57:17 +0800   Thu, 30 Jul 2020 16:42:25 +0800   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  10.90.1.131
  Hostname:    ai-1080ti-62
Capacity:
 cpu:                       56
 ephemeral-storage:         1152148172Ki
 hugepages-1Gi:             0
 hugepages-2Mi:             0
 memory:                    264029984Ki
 nvidia.com/gpu:            8
 pods:                      110
 tencent.com/vcuda-core:    800
 tencent.com/vcuda-memory:  349
Allocatable:
 cpu:                       53
 ephemeral-storage:         1040344917078
 hugepages-1Gi:             0
 hugepages-2Mi:             0
 memory:                    251344672Ki
 nvidia.com/gpu:            8
 pods:                      110
 tencent.com/vcuda-core:    800
 tencent.com/vcuda-memory:  349
System Info:
 Machine ID:                 bf90cb25500346cb8178be49909651e4
 System UUID:                00000000-0000-0000-0000-ac1f6b93483c
 Boot ID:                    97927469-0e92-4816-880c-243a64ef293a
 Kernel Version:             4.19.0-0.bpo.8-amd64
 OS Image:                   Debian GNU/Linux 9 (stretch)
 Operating System:           linux
 Architecture:               amd64
 Container Runtime Version:  docker://18.6.2
 Kubelet Version:            v1.13.5
 Kube-Proxy Version:         v1.13.5
...
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                  Requests           Limits
  --------                  --------           ------
  cpu                       51063m (96%)       83248m (157%)
  memory                    99256222976 (38%)  132428537Ki (52%)
  ephemeral-storage         0 (0%)             0 (0%)
  nvidia.com/gpu            7                  7
  tencent.com/vcuda-core    0                  0
  tencent.com/vcuda-memory  0                  0
Events:                     <none>

The go.mod:

module tkestack.io/gpu-admission

go 1.13

replace (
        //k8s.io/api => github.com/kubernetes/kubernetes/staging/src/k8s.io/api v0.0.0-20190816231410-2d3c76f9091b
        k8s.io/api => k8s.io/api kubernetes-1.13.5

        k8s.io/apiextensions-apiserver => github.com/kubernetes/kubernetes/staging/src/k8s.io/apiextensions-apiserver v0.0.0-20190816231410-2d3c76f9091b
        
        //k8s.io/apimachinery => github.com/kubernetes/kubernetes/staging/src/k8s.io/apimachinery v0.0.0-20190816231410-2d3c76f9091b
        k8s.io/apimachinery => k8s.io/apimachinery kubernetes-1.13.5
        
        k8s.io/apiserver => github.com/kubernetes/kubernetes/staging/src/k8s.io/apiserver v0.0.0-20190816231410-2d3c76f9091b
        k8s.io/cli-runtime => github.com/kubernetes/kubernetes/staging/src/k8s.io/cli-runtime v0.0.0-20190816231410-2d3c76f9091b

        //k8s.io/client-go => github.com/kubernetes/kubernetes/staging/src/k8s.io/client-go v0.0.0-20190816231410-2d3c76f9091b
        k8s.io/client-go => k8s.io/client-go kubernetes-1.13.5


        k8s.io/cloud-provider => github.com/kubernetes/kubernetes/staging/src/k8s.io/cloud-provider v0.0.0-20190816231410-2d3c76f9091b
        k8s.io/cluster-bootstrap => github.com/kubernetes/kubernetes/staging/src/k8s.io/cluster-bootstrap v0.0.0-20190816231410-2d3c76f9091b
        k8s.io/code-generator => github.com/kubernetes/kubernetes/staging/src/k8s.io/code-generator v0.0.0-20190816231410-2d3c76f9091b
        k8s.io/component-base => github.com/kubernetes/kubernetes/staging/src/k8s.io/component-base v0.0.0-20190816231410-2d3c76f9091b
        k8s.io/cri-api => github.com/kubernetes/kubernetes/staging/src/k8s.io/cri-api v0.0.0-20190816231410-2d3c76f9091b
        k8s.io/csi-translation-lib => github.com/kubernetes/kubernetes/staging/src/k8s.io/csi-translation-lib v0.0.0-20190816231410-2d3c76f9091b
        k8s.io/kube-aggregator => github.com/kubernetes/kubernetes/staging/src/k8s.io/kube-aggregator v0.0.0-20190816231410-2d3c76f9091b
        k8s.io/kube-controller-manager => github.com/kubernetes/kubernetes/staging/src/k8s.io/kube-controller-manager v0.0.0-20190816231410-2d3c76f9091b
        k8s.io/kube-proxy => github.com/kubernetes/kubernetes/staging/src/k8s.io/kube-proxy v0.0.0-20190816231410-2d3c76f9091b
        k8s.io/kube-scheduler => github.com/kubernetes/kubernetes/staging/src/k8s.io/kube-scheduler v0.0.0-20190816231410-2d3c76f9091b
        k8s.io/kubelet => github.com/kubernetes/kubernetes/staging/src/k8s.io/kubelet v0.0.0-20190816231410-2d3c76f9091b
        k8s.io/legacy-cloud-providers => github.com/kubernetes/kubernetes/staging/src/k8s.io/legacy-cloud-providers v0.0.0-20190816231410-2d3c76f9091b
        k8s.io/metrics => github.com/kubernetes/kubernetes/staging/src/k8s.io/metrics v0.0.0-20190816231410-2d3c76f9091b
        k8s.io/sample-apiserver => github.com/kubernetes/kubernetes/staging/src/k8s.io/sample-apiserver v0.0.0-20190816231410-2d3c76f9091b
)

require (
        github.com/gogo/protobuf v1.1.1 // indirect
        github.com/golang/protobuf v1.3.2 // indirect
        github.com/json-iterator/go v1.1.7 // indirect
        github.com/julienschmidt/httprouter v1.3.1-0.20191005171706-08a3b3d20bbe
        github.com/spf13/pflag v1.0.5
        golang.org/x/net v0.0.0-20191109021931-daa7c04131f5 // indirect
        golang.org/x/sys v0.0.0-20191010194322-b09406accb47 // indirect
        k8s.io/api v0.0.0
        k8s.io/apimachinery v0.0.0
        k8s.io/client-go v0.0.0
        k8s.io/component-base v0.0.0
        k8s.io/klog v0.3.1
        k8s.io/kubernetes v1.15.5
)

You changed the go.mod and compile binary yourself? The log tells that your pod is not found when appending annotation, so you have to check whether the pod is recreated with same name.

Thanks for reply. I recompiled with the client-go version consistent with k8s under your colleague suggestion when I encountered this problem.
Anyway, this problem has been solved when I used the gpu-admission binary file of my colleague. But I am still stuck on this issue today.

i have the same problem,I solved by modify system:kube-scheduler cluster role Add pod patch permissions.

I0827 09:05:13.099172 1 gpu_predicate.go:493] failed to add annotation map[tencent.com/gpu-assigned:false tencent.com/predicate-gpu-idx-0:0 tencent.com/predicate-node:vm8035 tencent.com/predicate-time:1598519113097128139] to pod 7540b57a-722f-4f0f-a747-f368c2b40768 due to pods "gpu-sleep" is forbidden: User "system:kube-scheduler" cannot patch resource "pods" in API group "" in the namespace "default"

kubectl edit clusterroles system:kube-scheduler
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - delete
  - get
  - patch
  - list
  - watch