update pod annotation failed
fighterhit opened this issue · comments
When I test the gpu-admission in k8s v1.13.5 ,I got the following error:
I0814 08:36:19.986356 1 gpu_predicate.go:493] failed to add annotation map[tencent.com/gpu-assigned:false tencent.com/predicate-gpu-idx-0:0 tencent.com/predicate-node:ai-1080ti-15 tencent.com/predicate-time:1597394179983794058] to pod 9a3a7c36-dd45-11ea-8e57-6c92bf66acae due to pods "test33" not found
I0814 08:36:19.986380 1 util.go:71] Determine if the container test33 needs GPU resource
I0814 08:36:19.986394 1 share.go:58] Pick up 0 , cores: 100, memory: 43
I0814 08:36:19.988944 1 gpu_predicate.go:493] failed to add annotation map[tencent.com/gpu-assigned:false tencent.com/predicate-gpu-idx-0:0 tencent.com/predicate-node:ai-1080ti-57 tencent.com/predicate-time:1597394179986399567] to pod 9a3a7c36-dd45-11ea-8e57-6c92bf66acae due to pods "test33" not found
I0814 08:36:19.988971 1 util.go:71] Determine if the container test33 needs GPU resource
I0814 08:36:19.988986 1 share.go:58] Pick up 0 , cores: 100, memory: 43
I0814 08:36:19.991268 1 gpu_predicate.go:493] failed to add annotation map[tencent.com/gpu-assigned:false tencent.com/predicate-gpu-idx-0:0 tencent.com/predicate-node:ai-1080ti-62 tencent.com/predicate-time:1597394179988992239] to pod 9a3a7c36-dd45-11ea-8e57-6c92bf66acae due to pods "test33" not found
...
I0814 08:36:19.992368 1 routes.go:81] GPUQuotaPredicate: extenderFilterResult = {"Nodes":{"metadata":{},"items":[]},"NodeNames":null,"FailedNodes":{"ai-1080ti-15":"update pod annotation failed","ai-1080ti-57":"update pod annotation failed","ai-1080ti-62":"update pod annotation failed"},"Error":""}
- The pod yaml
Name: test33
Namespace: danlu-efficiency
Priority: 0
PriorityClassName: <none>
Node: <none>
Labels: <none>
Annotations: <none>
Status: Pending
IP:
NominatedNodeName: ai-1080ti-62
Containers:
test33:
Image: danlu/tensorflow:tf1.9.0_py2_gpu_v0.1
Port: <none>
Host Port: <none>
Command:
/bin/bash
-c
sleep 100000000
Limits:
tencent.com/vcuda-core: 10
tencent.com/vcuda-memory: 30
Requests:
tencent.com/vcuda-core: 10
tencent.com/vcuda-memory: 30
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-p6lfp (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
default-token-p6lfp:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-p6lfp
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 4m56s (x1581 over 19h) gpu-admission 0/16 nodes are available: 1 node(s) were unschedulable, 12 Insufficient tencent.com/vcuda-core, 12 Insufficient tencent.com/vcuda-memory, 3 update pod annotation failed.
Nodes information:
- ai-1080ti-15
qa-jenkins@fuxi-qa-3:~/vgpu$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ai-1080ti-15 Ready nvidia 463d v1.13.3
ai-1080ti-57 Ready 1080ti 463d v1.13.3
ai-1080ti-62 Ready nvidia418 442d v1.13.5
fuxi-dl-42 Ready <none> 302d v1.13.5
fuxi-dl-46 Ready <none> 464d v1.13.3
fuxi-dl-47 Ready <none> 464d v1.13.3
fuxi-dl-48 Ready <none> 442d v1.13.5
fuxi-qa-10g Ready 1080ti,training 414d v1.13.5
fuxi-qa-12g Ready nvidia 414d v1.13.5
fuxi-qa-14 Ready <none> 353d v1.13.5
fuxi-qa-15 Ready <none> 353d v1.13.5
fuxi-qa-16 Ready <none> 309d v1.13.5
fuxi-qa-3 Ready,SchedulingDisabled master 603d v1.13.5
fuxi-qa-4 Ready <none> 464d v1.13.3
fuxi-qa-5 Ready <none> 464d v1.13.3
fuxi-qa-8g Ready nvidia 464d v1.13.3
NOTE: The nodes begining with 'ai' are GPU nodes and labeled with 'nvidia-device-enable=enable '. Some information about GPU nodes is as follows:
Name: ai-1080ti-15
Roles: nvidia
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
hardware=NVIDIAGPU
hardware-type=NVIDIAGPU
kubernetes.io/hostname=ai-1080ti-15
node-role.kubernetes.io/nvidia=GPU
nvidia-device-enable=enable
Annotations: kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
node.alpha.kubernetes.io/ttl: 0
projectcalico.org/IPv4Address: 10.200.0.72/24
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Wed, 08 May 2019 14:28:18 +0800
Taints: <none>
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Fri, 14 Aug 2020 17:54:08 +0800 Thu, 30 Apr 2020 18:26:46 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Fri, 14 Aug 2020 17:54:08 +0800 Thu, 30 Apr 2020 18:26:46 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Fri, 14 Aug 2020 17:54:08 +0800 Thu, 30 Apr 2020 18:26:46 +0800 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Fri, 14 Aug 2020 17:54:08 +0800 Thu, 30 Apr 2020 18:26:46 +0800 KubeletReady kubelet is posting ready status
OutOfDisk Unknown Wed, 08 May 2019 14:28:18 +0800 Wed, 08 May 2019 14:33:54 +0800 NodeStatusNeverUpdated Kubelet never posted node status.
Addresses:
InternalIP: 10.200.0.72
Hostname: ai-1080ti-15
Capacity:
cpu: 56
ephemeral-storage: 1153070996Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 264030000Ki
nvidia.com/gpu: 8
pods: 110
tencent.com/vcuda-core: 800
tencent.com/vcuda-memory: 349
Allocatable:
cpu: 53
ephemeral-storage: 1041195391675
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 251344688Ki
nvidia.com/gpu: 8
pods: 110
tencent.com/vcuda-core: 800
tencent.com/vcuda-memory: 349
System Info:
Machine ID: 2030b7c755d0458cbe03ef3b39b9412b
System UUID: 00000000-0000-0000-0000-ac1f6b27b26a
Boot ID: c9dce882-9bc3-478d-a81d-1a8dcfd02a4f
Kernel Version: 4.19.0-0.bpo.8-amd64
OS Image: Debian GNU/Linux 9 (stretch)
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://18.6.2
Kubelet Version: v1.13.3
Kube-Proxy Version: v1.13.3
...
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 51320m (96%) 93300m (176%)
memory 101122659840 (39%) 240384047Ki (95%)
ephemeral-storage 0 (0%) 0 (0%)
nvidia.com/gpu 7 7
tencent.com/vcuda-core 0 0
tencent.com/vcuda-memory 0 0
- ai-1080ti-57
Name: ai-1080ti-57
Roles: 1080ti
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
hardware=NVIDIAGPU
hardware-type=NVIDIAGPU
kubernetes.io/hostname=ai-1080ti-57
node-role.kubernetes.io/1080ti=1080ti
nvidia-device-enable=enable
Annotations: node.alpha.kubernetes.io/ttl: 0
projectcalico.org/IPv4Address: 10.90.1.126/24
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Wed, 08 May 2019 14:47:50 +0800
Taints: <none>
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Fri, 14 Aug 2020 17:56:33 +0800 Wed, 12 Aug 2020 19:44:29 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Fri, 14 Aug 2020 17:56:33 +0800 Wed, 12 Aug 2020 19:44:29 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Fri, 14 Aug 2020 17:56:33 +0800 Wed, 12 Aug 2020 19:44:29 +0800 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Fri, 14 Aug 2020 17:56:33 +0800 Wed, 12 Aug 2020 19:44:29 +0800 KubeletReady kubelet is posting ready status
OutOfDisk Unknown Wed, 08 May 2019 14:47:50 +0800 Fri, 09 Aug 2019 11:58:18 +0800 NodeStatusNeverUpdated Kubelet never posted node status.
Addresses:
InternalIP: 10.90.1.126
Hostname: ai-1080ti-57
Capacity:
cpu: 56
ephemeral-storage: 1152148172Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 264029980Ki
nvidia.com/gpu: 8
pods: 110
tencent.com/vcuda-core: 800
tencent.com/vcuda-memory: 349
Allocatable:
cpu: 53
ephemeral-storage: 1040344917078
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 251344668Ki
nvidia.com/gpu: 8
pods: 110
tencent.com/vcuda-core: 800
tencent.com/vcuda-memory: 349
System Info:
Machine ID: 3ff54e221e0d475bacbe8a68bd0dd2e2
System UUID: 00000000-0000-0000-0000-ac1f6b91d6e8
Boot ID: efc67001-4c66-4fca-946c-d13f0931fcc2
Kernel Version: 4.19.0-0.bpo.8-amd64
OS Image: Debian GNU/Linux 9 (stretch)
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://18.6.2
Kubelet Version: v1.13.3
Kube-Proxy Version: v1.13.3
...
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 47730m (90%) 73100m (137%)
memory 111630921472 (43%) 186532353536 (72%)
ephemeral-storage 0 (0%) 0 (0%)
nvidia.com/gpu 7 7
tencent.com/vcuda-core 0 0
tencent.com/vcuda-memory 0 0
Events: <none>
- ai-1080ti-62
Name: ai-1080ti-62
Roles: nvidia418
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
hardware=NVIDIAGPU
hardware-type=NVIDIAGPU
kubernetes.io/hostname=ai-1080ti-62
node-role.kubernetes.io/nvidia418=nvidia418
nvidia-device-enable=enable
Annotations: kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
node.alpha.kubernetes.io/ttl: 0
projectcalico.org/IPv4Address: 10.90.1.131/24
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Wed, 29 May 2019 18:02:54 +0800
Taints: <none>
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Fri, 14 Aug 2020 17:57:17 +0800 Thu, 30 Jul 2020 16:42:25 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Fri, 14 Aug 2020 17:57:17 +0800 Thu, 30 Jul 2020 16:42:25 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Fri, 14 Aug 2020 17:57:17 +0800 Thu, 30 Jul 2020 16:42:25 +0800 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Fri, 14 Aug 2020 17:57:17 +0800 Thu, 30 Jul 2020 16:42:25 +0800 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 10.90.1.131
Hostname: ai-1080ti-62
Capacity:
cpu: 56
ephemeral-storage: 1152148172Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 264029984Ki
nvidia.com/gpu: 8
pods: 110
tencent.com/vcuda-core: 800
tencent.com/vcuda-memory: 349
Allocatable:
cpu: 53
ephemeral-storage: 1040344917078
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 251344672Ki
nvidia.com/gpu: 8
pods: 110
tencent.com/vcuda-core: 800
tencent.com/vcuda-memory: 349
System Info:
Machine ID: bf90cb25500346cb8178be49909651e4
System UUID: 00000000-0000-0000-0000-ac1f6b93483c
Boot ID: 97927469-0e92-4816-880c-243a64ef293a
Kernel Version: 4.19.0-0.bpo.8-amd64
OS Image: Debian GNU/Linux 9 (stretch)
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://18.6.2
Kubelet Version: v1.13.5
Kube-Proxy Version: v1.13.5
...
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 51063m (96%) 83248m (157%)
memory 99256222976 (38%) 132428537Ki (52%)
ephemeral-storage 0 (0%) 0 (0%)
nvidia.com/gpu 7 7
tencent.com/vcuda-core 0 0
tencent.com/vcuda-memory 0 0
Events: <none>
The go.mod:
module tkestack.io/gpu-admission
go 1.13
replace (
//k8s.io/api => github.com/kubernetes/kubernetes/staging/src/k8s.io/api v0.0.0-20190816231410-2d3c76f9091b
k8s.io/api => k8s.io/api kubernetes-1.13.5
k8s.io/apiextensions-apiserver => github.com/kubernetes/kubernetes/staging/src/k8s.io/apiextensions-apiserver v0.0.0-20190816231410-2d3c76f9091b
//k8s.io/apimachinery => github.com/kubernetes/kubernetes/staging/src/k8s.io/apimachinery v0.0.0-20190816231410-2d3c76f9091b
k8s.io/apimachinery => k8s.io/apimachinery kubernetes-1.13.5
k8s.io/apiserver => github.com/kubernetes/kubernetes/staging/src/k8s.io/apiserver v0.0.0-20190816231410-2d3c76f9091b
k8s.io/cli-runtime => github.com/kubernetes/kubernetes/staging/src/k8s.io/cli-runtime v0.0.0-20190816231410-2d3c76f9091b
//k8s.io/client-go => github.com/kubernetes/kubernetes/staging/src/k8s.io/client-go v0.0.0-20190816231410-2d3c76f9091b
k8s.io/client-go => k8s.io/client-go kubernetes-1.13.5
k8s.io/cloud-provider => github.com/kubernetes/kubernetes/staging/src/k8s.io/cloud-provider v0.0.0-20190816231410-2d3c76f9091b
k8s.io/cluster-bootstrap => github.com/kubernetes/kubernetes/staging/src/k8s.io/cluster-bootstrap v0.0.0-20190816231410-2d3c76f9091b
k8s.io/code-generator => github.com/kubernetes/kubernetes/staging/src/k8s.io/code-generator v0.0.0-20190816231410-2d3c76f9091b
k8s.io/component-base => github.com/kubernetes/kubernetes/staging/src/k8s.io/component-base v0.0.0-20190816231410-2d3c76f9091b
k8s.io/cri-api => github.com/kubernetes/kubernetes/staging/src/k8s.io/cri-api v0.0.0-20190816231410-2d3c76f9091b
k8s.io/csi-translation-lib => github.com/kubernetes/kubernetes/staging/src/k8s.io/csi-translation-lib v0.0.0-20190816231410-2d3c76f9091b
k8s.io/kube-aggregator => github.com/kubernetes/kubernetes/staging/src/k8s.io/kube-aggregator v0.0.0-20190816231410-2d3c76f9091b
k8s.io/kube-controller-manager => github.com/kubernetes/kubernetes/staging/src/k8s.io/kube-controller-manager v0.0.0-20190816231410-2d3c76f9091b
k8s.io/kube-proxy => github.com/kubernetes/kubernetes/staging/src/k8s.io/kube-proxy v0.0.0-20190816231410-2d3c76f9091b
k8s.io/kube-scheduler => github.com/kubernetes/kubernetes/staging/src/k8s.io/kube-scheduler v0.0.0-20190816231410-2d3c76f9091b
k8s.io/kubelet => github.com/kubernetes/kubernetes/staging/src/k8s.io/kubelet v0.0.0-20190816231410-2d3c76f9091b
k8s.io/legacy-cloud-providers => github.com/kubernetes/kubernetes/staging/src/k8s.io/legacy-cloud-providers v0.0.0-20190816231410-2d3c76f9091b
k8s.io/metrics => github.com/kubernetes/kubernetes/staging/src/k8s.io/metrics v0.0.0-20190816231410-2d3c76f9091b
k8s.io/sample-apiserver => github.com/kubernetes/kubernetes/staging/src/k8s.io/sample-apiserver v0.0.0-20190816231410-2d3c76f9091b
)
require (
github.com/gogo/protobuf v1.1.1 // indirect
github.com/golang/protobuf v1.3.2 // indirect
github.com/json-iterator/go v1.1.7 // indirect
github.com/julienschmidt/httprouter v1.3.1-0.20191005171706-08a3b3d20bbe
github.com/spf13/pflag v1.0.5
golang.org/x/net v0.0.0-20191109021931-daa7c04131f5 // indirect
golang.org/x/sys v0.0.0-20191010194322-b09406accb47 // indirect
k8s.io/api v0.0.0
k8s.io/apimachinery v0.0.0
k8s.io/client-go v0.0.0
k8s.io/component-base v0.0.0
k8s.io/klog v0.3.1
k8s.io/kubernetes v1.15.5
)
You changed the go.mod and compile binary yourself? The log tells that your pod is not found when appending annotation, so you have to check whether the pod is recreated with same name.
Thanks for reply. I recompiled with the client-go version consistent with k8s under your colleague suggestion when I encountered this problem.
Anyway, this problem has been solved when I used the gpu-admission binary file of my colleague. But I am still stuck on this issue today.
i have the same problem,I solved by modify system:kube-scheduler cluster role Add pod patch permissions.
I0827 09:05:13.099172 1 gpu_predicate.go:493] failed to add annotation map[tencent.com/gpu-assigned:false tencent.com/predicate-gpu-idx-0:0 tencent.com/predicate-node:vm8035 tencent.com/predicate-time:1598519113097128139] to pod 7540b57a-722f-4f0f-a747-f368c2b40768 due to pods "gpu-sleep" is forbidden: User "system:kube-scheduler" cannot patch resource "pods" in API group "" in the namespace "default"
kubectl edit clusterroles system:kube-scheduler
- apiGroups:
- ""
resources:
- pods
verbs:
- delete
- get
- patch
- list
- watch