allow to continue on a Conflict

Question

allow to continue on a Conflict

universam1 opened this issue a year ago · comments

We would like Kapp to continue the rollout even on a Conflict so that the resources are rolled out in best effort. Related to #573 however we need the ability not to fail instantly in such occasions.

Background

We are using Kapp to deploy whole cluster including apps using the Kapp gitOps style deployment. It sums up to 800 resources for a new cluster. (This ability of Kapp to have the whole cluster in a desired state is so powerful for us that we switched our tooling completely on Kapp - amazing work done here and we love Kapp for that!)

Since the rollout of so many resources takes time, there is a good chance to run into an error like the object has been modified; please apply your changes to the latest version and try again (reason: Conflict) for various resources. Some of them are mere conflicts with the AWS EKS internal bootstrapping where Kapp might arrive too early, others are due to status updates happening in meantime btw. the two Kapp stages.

Both cases are probably not fully avoidable.
For some resources (IAM RBAC) Kapp failing hard is fatal for the whole cluster, that means the cluster is impossible to access and recover. These deployments must happen by all means!

What happened:

Please see following examples of Kapp failing instantly.

Unavoidable race condition with AWS EKS internal bootstrapping

Error: update daemonset/aws-node (apps/v1) namespace: kube-system: Failed to update due to resource conflict (approved diff no longer matches): Updating resource daemonset/aws-node (apps/v1) namespace: kube-system: API server says: Operation cannot be fulfilled on daemonsets.apps "aws-node": the object has been modified; please apply your changes to the latest version and try again (reason: Conflict): Recalculated diff:
  3,  3 -   annotations:
  4,  3 -     deprecated.daemonset.template.generation: "1"
  5,  3 -     kubectl.kubernetes.io/last-applied-configuration: |
  6,  3 -       {"apiVersion":"apps/v1","kind":"DaemonSet","metadata":{"annotations":{},"labels":{"app.kubernetes.io/instance":"aws-vpc-cni","app.kubernetes.io/name":"aws-node","app.kubernetes.io/version":"v1.11.4","k8s-app":"aws-node"},"name":"aws-node","namespace":"kube-system"},"spec":{"selector":{"matchLabels":{"k8s-app":"aws-node"}},"template":{"metadata":{"labels":{"app.kubernetes.io/instance":"aws-vpc-cni","app.kubernetes.io/name":"aws-node","k8s-app":"aws-node"}},"spec":{"affinity":{"nodeAffinity":{"requiredDuringSchedulingIgnoredDuringExecution":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"kubernetes.io/os","operator":"In","values":["linux"]},{"key":"kubernetes.io/arch","operator":"In","values":["amd64","arm64"]},{"key":"eks.amazonaws.com/compute-type","operator":"NotIn","values":["fargate"]}]}]}}},"containers":[{"env":[{"name":"ADDITIONAL_ENI_TAGS","value":"{}"},{"name":"AWS_VPC_CNI_NODE_PORT_SUPPORT","value":"true"},{"name":"AWS_VPC_ENI_MTU","value":"9001"},{"name":"AWS_VPC_K8S_CNI_CONFIGURE_RPFILTER","value":"false"},{"name":"AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG","value":"false"},{"name":"AWS_VPC_K8S_CNI_EXTERNALSNAT","value":"false"},{"name":"AWS_VPC_K8S_CNI_LOGLEVEL","value":"DEBUG"},{"name":"AWS_VPC_K8S_CNI_LOG_FILE","value":"/host/var/log/aws-routed-eni/ipamd.log"},{"name":"AWS_VPC_K8S_CNI_RANDOMIZESNAT","value":"prng"},{"name":"AWS_VPC_K8S_CNI_VETHPREFIX","value":"eni"},{"name":"AWS_VPC_K8S_PLUGIN_LOG_FILE","value":"/var/log/aws-routed-eni/plugin.log"},{"name":"AWS_VPC_K8S_PLUGIN_LOG_LEVEL","value":"DEBUG"},{"name":"DISABLE_INTROSPECTION","value":"false"},{"name":"DISABLE_METRICS","value":"false"},{"name":"DISABLE_NETWORK_RESOURCE_PROVISIONING","value":"false"},{"name":"ENABLE_IPv4","value":"true"},{"name":"ENABLE_IPv6","value":"false"},{"name":"ENABLE_POD_ENI","value":"false"},{"name":"ENABLE_PREFIX_DELEGATION","value":"false"},{"name":"WARM_ENI_TARGET","value":"1"},{"name":"WARM_PREFIX_TARGET","value":"1"},{"name":"MY_NODE_NAME","valueFrom":{"fieldRef":{"fieldPath":"spec.nodeName"}}}],"image":"602401143452.dkr.ecr.us-east-2.amazonaws.com/amazon-k8s-cni:v1.11.4-eksbuild.1","livenessProbe":{"exec":{"command":["/app/grpc-health-probe","-addr=:50051","-connect-timeout=5s","-rpc-timeout=5s"]},"initialDelaySeconds":60,"timeoutSeconds":10},"name":"aws-node","ports":[{"containerPort":61678,"name":"metrics"}],"readinessProbe":{"exec":{"command":["/app/grpc-health-probe","-addr=:50051","-connect-timeout=5s","-rpc-timeout=5s"]},"initialDelaySeconds":1,"timeoutSeconds":10},"resources":{"requests":{"cpu":"25m"}},"securityContext":{"capabilities":{"add":["NET_ADMIN"]}},"volumeMounts":[{"mountPath":"/host/opt/cni/bin","name":"cni-bin-dir"},{"mountPath":"/host/etc/cni/net.d","name":"cni-net-dir"},{"mountPath":"/host/var/log/aws-routed-eni","name":"log-dir"},{"mountPath":"/var/run/dockershim.sock","name":"dockershim"},{"mountPath":"/var/run/aws-node","name":"run-dir"},{"mountPath":"/run/xtables.lock","name":"xtables-lock"}]}],"hostNetwork":true,"initContainers":[{"env":[{"name":"DISABLE_TCP_EARLY_DEMUX","value":"false"},{"name":"ENABLE_IPv6","value":"false"}],"image":"602401143452.dkr.ecr.us-east-2.amazonaws.com/amazon-k8s-cni-init:v1.11.4-eksbuild.1","name":"aws-vpc-cni-init","securityContext":{"privileged":true},"volumeMounts":[{"mountPath":"/host/opt/cni/bin","name":"cni-bin-dir"}]}],"priorityClassName":"system-node-critical","securityContext":{},"serviceAccountName":"aws-node","terminationGracePeriodSeconds":10,"tolerations":[{"operator":"Exists"}],"volumes":[{"hostPath":{"path":"/opt/cni/bin"},"name":"cni-bin-dir"},{"hostPath":{"path":"/etc/cni/net.d"},"name":"cni-net-dir"},{"hostPath":{"path":"/var/run/dockershim.sock"},"name":"dockershim"},{"hostPath":{"path":"/var/log/aws-routed-eni","type":"DirectoryOrCreate"},"name":"log-dir"},{"hostPath":{"path":"/var/run/aws-node","type":"DirectoryOrCreate"},"name":"run-dir"},{"hostPath":{"path":"/run/xtables.lock"},"name":"xtables-lock"}]}},"updateStrategy":{"rollingUpdate":{"maxUnavailable":"10%"},"type":"RollingUpdate"}}}
 12,  8 -     app.kubernetes.io/version: v1.11.4
 14,  9 +     kapp.k14s.io/app: "168243[6697](https://git.bethelservice.org/o11n/eks/-/jobs/15721541#L6697)556578494"
 14, 10 +     kapp.k14s.io/association: v1.ca251169611f162ef5186bbf4f512ca0
319,316 -   revisionHistoryLimit: 10
325,321 -       creationTimestamp: null
326,321 +       annotations:
326,322 +         nodetaint/crucial: "true"
330,327 +         kapp.k14s.io/app: "1682436697556578494"
330,328 +         kapp.k14s.io/association: v1.ca251169611f162ef5186bbf4f512ca0
366,365 -           value: /host/var/log/aws-routed-eni/ipamd.log
367,365 +           value: stdout
372,371 -           value: /var/log/aws-routed-eni/plugin.log
373,371 +           value: stderr
396,395 -               apiVersion: v1
398,396 -         image: 602401143452.dkr.ecr.us-east-2.amazonaws.com/amazon-k8s-cni:v1.11.4-eksbuild.1
399,396 -         imagePullPolicy: IfNotPresent
400,396 +         image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.12.5
407,404 -           failureThreshold: 3
409,405 -           periodSeconds: 10
410,405 -           successThreshold: 1
415,409 -           hostPort: 61678
417,410 -           protocol: TCP
425,417 -           failureThreshold: 3
427,418 -           periodSeconds: 10
428,418 -           successThreshold: 1
432,421 -             cpu: 25m
433,421 +             cpu: 50m
433,422 +             memory: 80Mi
437,427 -         terminationMessagePath: /dev/termination-log
438,427 -         terminationMessagePolicy: File
439,427 +             - NET_RAW
446,435 -         - mountPath: /var/run/dockershim.sock
447,435 -           name: dockershim
452,439 -       dnsPolicy: ClusterFirst
460,446 -         image: 602401143452.dkr.ecr.us-east-2.amazonaws.com/amazon-k8s-cni-init:v1.11.4-eksbuild.1
461,446 -         imagePullPolicy: IfNotPresent
462,446 +         image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni-init:v1.12.5
463,448 -         resources: {}
466,450 -         terminationMessagePath: /dev/termination-log
467,450 -         terminationMessagePolicy: File
472,454 -       restartPolicy: Always
473,454 -       schedulerName: default-scheduler
475,455 -       serviceAccount: aws-node
483,462 -           type: ""
487,465 -           type: ""
489,466 -       - hostPath:
490,466 -           path: /var/run/dockershim.sock
491,466 -           type: ""
492,466 -         name: dockershim
503,476 -           type: ""
507,479 -       maxSurge: 0
510,481 - status:
511,481 -   currentNumberScheduled: 3
512,481 -   desiredNumberScheduled: 3
513,481 -   numberMisscheduled: 0
514,481 -   numberReady: 0
515,481 -   numberUnavailable: 3
516,481 -   observedGeneration: 1
517,481 -   updatedNumberScheduled: 3

Race condition with AWS EKS internal bootstrapping

Error: update daemonset/kube-proxy (apps/v1) namespace: kube-system: Failed to update due to resource conflict (approved diff no longer matches): Updating resource daemonset/kube-proxy (apps/v1) namespace: kube-system: API server says: Operation cannot be fulfilled on daemonsets.apps "kube-proxy": the object has been modified; please apply your changes to the latest version and try again (reason: Conflict): Recalculated diff:
  3,  3 -   annotations:
  4,  3 -     deprecated.daemonset.template.generation: "1"
  5,  3 -     kubectl.kubernetes.io/last-applied-configuration: |
  6,  3 -       {"apiVersion":"apps/v1","kind":"DaemonSet","metadata":{"annotations":{},"labels":{"eks.amazonaws.com/component":"kube-proxy","k8s-app":"kube-proxy"},"name":"kube-proxy","namespace":"kube-system"},"spec":{"selector":{"matchLabels":{"k8s-app":"kube-proxy"}},"template":{"metadata":{"labels":{"k8s-app":"kube-proxy"}},"spec":{"affinity":{"nodeAffinity":{"requiredDuringSchedulingIgnoredDuringExecution":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"kubernetes.io/os","operator":"In","values":["linux"]},{"key":"kubernetes.io/arch","operator":"In","values":["amd64","arm64"]},{"key":"eks.amazonaws.com/compute-type","operator":"NotIn","values":["fargate"]}]}]}}},"containers":[{"command":["kube-proxy","--v=2","--config=/var/lib/kube-proxy-config/config","--hostname-override=$(NODE_NAME)"],"env":[{"name":"NODE_NAME","valueFrom":{"fieldRef":{"fieldPath":"spec.nodeName"}}}],"image":"602401143452.dkr.ecr.us-east-2.amazonaws.com/eks/kube-proxy:v1.24.7-minimal-eksbuild.2","name":"kube-proxy","resources":{"requests":{"cpu":"100m"}},"securityContext":{"privileged":true},"volumeMounts":[{"mountPath":"/var/log","name":"varlog","readOnly":false},{"mountPath":"/run/xtables.lock","name":"xtables-lock","readOnly":false},{"mountPath":"/lib/modules","name":"lib-modules","readOnly":true},{"mountPath":"/var/lib/kube-proxy/","name":"kubeconfig"},{"mountPath":"/var/lib/kube-proxy-config/","name":"config"}]}],"hostNetwork":true,"priorityClassName":"system-node-critical","serviceAccountName":"kube-proxy","tolerations":[{"operator":"Exists"}],"volumes":[{"hostPath":{"path":"/var/log"},"name":"varlog"},{"hostPath":{"path":"/run/xtables.lock","type":"FileOrCreate"},"name":"xtables-lock"},{"hostPath":{"path":"/lib/modules"},"name":"lib-modules"},{"configMap":{"name":"kube-proxy"},"name":"kubeconfig"},{"configMap":{"name":"kube-proxy-config"},"name":"config"}]}},"updateStrategy":{"rollingUpdate":{"maxUnavailable":"10%"},"type":"RollingUpdate"}}}
 12,  8 +     kapp.k14s.io/app: "1682434919650750393"
 12,  9 +     kapp.k14s.io/association: v1.5c5a114581f350e2b57df0ed7799471d
167,165 +       annotations:
167,166 +         nodetaint/crucial: "true"
170,170 +         kapp.k14s.io/app: "1682434919650750393"
170,171 +         kapp.k14s.io/association: v1.5c5a114581f350e2b57df0ed7799471d
194,196 -         - --hostname-override=$(NODE_NAME)
195,196 -         env:
196,196 -         - name: NODE_NAME
197,196 -           valueFrom:
198,196 -             fieldRef:
199,196 -               apiVersion: v1
200,196 -               fieldPath: spec.nodeName
201,196 -         image: 602401143452.dkr.ecr.us-east-2.amazonaws.com/eks/kube-proxy:v1.24.7-minimal-eksbuild.2
202,196 +         image: 602401143452.dkr.ecr.us-east-2.amazonaws.com/eks/kube-proxy:v1.24.9-minimal-eksbuild.1
206,201 -             cpu: 100m
207,201 +             cpu: 50m
207,202 +             memory: 45Mi
260,256 - status:
261,256 -   currentNumberScheduled: 3
262,256 -   desiredNumberScheduled: 3
263,256 -   numberMisscheduled: 0
264,256 -   numberReady: 0
265,256 -   numberUnavailable: 3
266,256 -   observedGeneration: 1
267,256 -   updatedNumberScheduled: 3

Race condition - probably a bug?

Error: update poddisruptionbudget/kube-metrics-adapter (policy/v1) namespace: kube-system: Failed to update due to resource conflict (approved diff no longer matches): Updating resource poddisruptionbudget/kube-metrics-adapter (policy/v1) namespace: kube-system: API server says: Operation cannot be fulfilled on poddisruptionbudgets.policy "kube-metrics-adapter": the object has been modified; please apply your changes to the latest version and try again (reason: Conflict): Recalculated diff:
  3,  3 -   annotations: {}
 58, 57 - status:
 59, 57 -   conditions:
 60, 57 -   - lastTransitionTime: "2023-04-25T14:48:52Z"
 61, 57 -     message: ""
 62, 57 -     observedGeneration: 1
 63, 57 -     reason: InsufficientPods
 64, 57 -     status: "False"
 65, 57 -     type: DisruptionAllowed
 66, 57 -   currentHealthy: 0
 67, 57 -   desiredHealthy: 1
 68, 57 -   disruptionsAllowed: 0
 69, 57 -   expectedPods: 2
 70, 57 -   observedGeneration: 1

A controller update in meantime

1:53:10PM: update role/iam-auth-default-rw (iam.services.k8s.aws/v1alpha1) namespace: ack-system
Error: update role/iam-auth-default-rw (iam.services.k8s.aws/v1alpha1) namespace: ack-system: Failed to update due to resource conflict (approved diff no longer matches): Updating resource role/iam-auth-default-rw (iam.services.k8s.aws/v1alpha1) namespace: ack-system: API server says: Operation cannot be fulfilled on roles.iam.services.k8s.aws "iam-auth-default-rw": the object has been modified; please apply your changes to the latest version and try again (reason: Conflict): Recalculated diff:
  3,  3 -   annotations: {}
 77, 76 - status:
 78, 76 -   ackResourceMetadata:
 79, 76 -     ownerAccountID: "809228258731"
 80, 76 -     region: us-east-1
 81, 76 -   conditions:
 82, 76 -   - message: "AccessDenied: User: arn:aws:sts::xxxx:assumed-role/xx-eks-int-bvwhallns-NodeIamRole-T1TXSP3WQW7U/i-03578dab635b195e2
 83, 76 -       is not authorized to perform: iam:GetRole on resource: role o11n-eks-int-bvwhallns@default-rw
 84, 76 -       because no identity-based policy allows the iam:GetRole action\n\tstatus code:
 85, 76 -       403, request id: d0818e1c-ede2-40ca-b89e-e00951f93052"
 86, 76 -     status: "True"
 87, 76 -     type: ACK.Recoverable
 88, 76 -   - lastTransitionTime: "2023-04-25T13:53:09Z"
 89, 76 -     message: Unable to determine if desired resource state matches latest observed
 90, 76 -       state
 91, 76 -     reason: "AccessDenied: User: arn:aws:sts::xxx:assumed-role/xx-eks-int-bvwhallns-NodeIamRole-T1TXSP3WQW7U/i-03578dab635b195e2
 92, 76 -       is not authorized to perform: iam:GetRole on resource: role o11n-eks-int-bvwhallns@default-rw
 93, 76 -       because no identity-based policy allows the iam:GetRole action\n\tstatus code:
 94, 76 -       403, request id: d0818e1c-ede2-40ca-b89e-e00951f93052"
 95, 76 -     status: Unknown
 96, 76 -     type: ACK.ResourceSynced

Another status update

Error: update deployment/metrics-server (apps/v1) namespace: kube-system: Failed to update due to resource conflict (approved diff no longer matches): Updating resource deployment/metrics-server (apps/v1) namespace: kube-system: API server says: Operation cannot be fulfilled on deployments.apps "metrics-server": the object has been modified; please apply your changes to the latest version and try again (reason: Conflict): Recalculated diff:
166,166 -   progressDeadlineSeconds: 600
168,167 -   revisionHistoryLimit: 10
174,172 -       maxSurge: 25%
176,173 -     type: RollingUpdate
179,175 -       creationTimestamp: null
210,205 -           successThreshold: 1
211,205 -           timeoutSeconds: 1
225,218 -           successThreshold: 1
226,218 -           timeoutSeconds: 1
239,230 -         terminationMessagePath: /dev/termination-log
240,230 -         terminationMessagePolicy: File
244,233 -       dnsPolicy: ClusterFirst
248,236 -       restartPolicy: Always
249,236 -       schedulerName: default-scheduler
250,236 -       securityContext: {}
251,236 -       serviceAccount: metrics-server
253,237 -       terminationGracePeriodSeconds: 30
257,240 - status:
258,240 -   availableReplicas: 2
259,240 -   conditions:
260,240 -   - lastTransitionTime: "2023-04-25T13:49:12Z"
261,240 -     lastUpdateTime: "2023-04-25T13:49:12Z"
262,240 -     message: Deployment has minimum availability.
263,240 -     reason: MinimumReplicasAvailable
264,240 -     status: "True"
265,240 -     type: Available
266,240 -   - lastTransitionTime: "2023-04-25T13:48:22Z"
267,240 -     lastUpdateTime: "2023-04-25T13:49:15Z"
268,240 -     message: ReplicaSet "metrics-server-79bccb4d98" has successfully progressed.
269,240 -     reason: NewReplicaSetAvailable
270,240 -     status: "True"
271,240 -     type: Progressing
272,240 -   observedGeneration: 1
273,240 -   readyReplicas: 2
274,240 -   replicas: 2
275,240 -   updatedReplicas: 2

a status update bug?

Error: update role/iam-auth-monitoring-rw (iam.services.k8s.aws/v1alpha1) namespace: ack-system: Failed to update due to resource conflict (approved diff no longer matches): Updating resource role/iam-auth-monitoring-rw (iam.services.k8s.aws/v1alpha1) namespace: ack-system: API server says: Operation cannot be fulfilled on roles.iam.services.k8s.aws "iam-auth-monitoring-rw": the object has been modified; please apply your changes to the latest version and try again (reason: Conflict): Recalculated diff:
  3,  3 -   annotations: {}
 77, 76 - status:
 78, 76 -   ackResourceMetadata:
 79, 76 -     ownerAccountID: "845194625280"
 80, 76 -     region: us-east-1
 81, 76 -   conditions:
 82, 76 -   - message: "AccessDenied: User: arn:aws:sts::xxx:assumed-role/o11n-eks-int-sam-NodeIamRole-1XHFCRCCZBNBX/i-0c6f929e0b30747ca
 83, 76 -       is not authorized to perform: iam:GetRole on resource: role o11n-eks-int-sam@monitoring-rw
 84, 76 -       because no identity-based policy allows the iam:GetRole action\n\tstatus code:
 85, 76 -       403, request id: 664701d0-648d-4d1f-981d-a98c91839776"
 86, 76 -     status: "True"
 87, 76 -     type: ACK.Recoverable
 88, 76 -   - lastTransitionTime: "2023-04-27T07:47:07Z"
 89, 76 -     message: Unable to determine if desired resource state matches latest observed
 90, 76 -       state
 91, 76 -     reason: "AccessDenied: User: arn:aws:sts::xxx:assumed-role/o11n-eks-int-sam-NodeIamRole-1XHFCRCCZBNBX/i-0c6f929e0b30747ca
 92, 76 -       is not authorized to perform: iam:GetRole on resource: role o11n-eks-int-sam@monitoring-rw
 93, 76 -       because no identity-based policy allows the iam:GetRole action\n\tstatus code:
 94, 76 -       403, request id: 664701d0-648d-4d1f-981d-xxx"
 95, 76 -     status: Unknown
 96, 76 -     type: ACK.ResourceSynced

What did you expect:

Kapp to finish up the rollout with the remaining resources. In our full-cluster desired state that can be critical, fatal ones and hundreds of other resources that are getting out of relation, that is deployment order.

IMHO those cases do not require that Kapp stopps immediately its run, it would be totally sufficient to finish up the remaining deployments.

Ideas to solve, ordered by simplicity

add a flag continueOnConflict to log out only instead of raising an error.
handle the error at the end of the run
retry those failing resources later

Anything else you would like to add:

We are deploying all cluster resources as one Kapp App. We've decided to go that route because we get the "world view" of Kapp, we can define easily the dependencies where necessary, and we have a desired state in the cluster, cleaning up removed deployments.

Environment:

kapp version 0.55.0
Kubernetes version: AWS EKS 1.24

Vote on this request

This is an invitation to the community to vote on issues, to help us prioritize our backlog. Use the "smiley face" up to the right of this comment to vote.

👍 "I would like to see this addressed as soon as possible"
👎 "There are other more important things to focus on right now"

We are also happy to receive and review Pull Requests if you want to help working on this issue.

github-actions · Answer 1 · Wed Jun 07 2023 08:15:33 GMT+0800 (China Standard Time)

This issue is being marked as stale due to a long period of inactivity and will be closed in 5 days if there is no response.

github-actions · Answer 2 · Tue Jul 18 2023 08:21:58 GMT+0800 (China Standard Time)

This issue is being marked as stale due to a long period of inactivity and will be closed in 5 days if there is no response.

Samuel Lang · Answer 3 · Tue Jul 18 2023 15:45:24 GMT+0800 (China Standard Time)

/remove stale

Samuel Lang · Answer 4 · Mon Jul 31 2023 17:32:52 GMT+0800 (China Standard Time)

possibly related to #746

github-actions · Answer 5 · Sun Sep 10 2023 08:13:45 GMT+0800 (China Standard Time)

This issue is being marked as stale due to a long period of inactivity and will be closed in 5 days if there is no response.