argocd gets stuck during civo management cluster provisioning

Question

argocd gets stuck during civo management cluster provisioning

johndietz opened this issue 4 months ago · comments

in 2.4.0 through 2.4.2, we have encountered an intermittent issue of argocd getting stuck shortly after taking on the sync wave where it puts itself under its own management.

when this issue presents itself, if you port forward to argocd, you can observe deployments in argocd progressing with the status "Waiting for rollout to finish: observed deployment generation less than desired generation" which will never complete.

it's a bad circumstance to fall into, as there is no known remedy that will bring the cluster back cleanly under argocd's control other than a full management stack reprovision. we've tried orphaning resources, removing argocd and adding it back, among many other hammer-oriented techniques without revitalizing a cluster that gets in this state.

anecdotally, this seems to be more prevalant on civo github than the other stacks. we will continue to prioritize this issue as a top priority, and expedite its resolution.

if you have encountered this issue, please comment with your git and cloud provider details.

John Dietz · Answer 1 · Tue Mar 19 2024 21:03:50 GMT+0800 (China Standard Time)

this may be related to unresolved issue
argoproj/argo-cd#14266

John Dietz · Answer 2 · Thu Mar 21 2024 03:48:30 GMT+0800 (China Standard Time)

some more details:

when we install argocd to the management cluster, we kustomize against an upstream cloud installation:
https://github.com/kubefirst/manifests/blob/main/argocd/cloud/kustomization.yaml
there's a spot in our orchestration where we update argocd with the newly created oidc configurations. to do so, we gitops a similar gitops kustomize install, but bound to your own gitops repo with details specific to your domain, sso, etc.
https://github.com/kubefirst/gitops-template/blob/main/civo-github/templates/mgmt/components/argocd/kustomization.yaml#L9-L11
we then restart argocd-server after applying the update to argocd's config in order to have those new vault sso settings take effect:
https://github.com/kubefirst/gitops-template/blob/main/civo-github/templates/mgmt/components/argocd/argocd-oidc-restart-job.yaml#L58
when argocd-server wakes up from the restart, it’s mysteriously unable to reconcile or manage many of the deployment or statefulset resources that’s it’s tracking.
misc relevant details:

so far we've only observed this on resources that were created after the argocd restart sync wave
this is only occurring on civo cloud, we've audited across argocd registry setup and api sequencing shows no inconsistencies with the other clouds
earlier binary known to work on civo (2.3.8) now also exhibiting this behavior

John Dietz · Answer 3 · Fri Mar 22 2024 04:02:04 GMT+0800 (China Standard Time)

this has been fixed in 2.4.3 🚀

John Dietz · Answer 4 · Fri Mar 22 2024 06:47:47 GMT+0800 (China Standard Time)

hey everyone - we’ve had some turbulence this week provisioning management clusters with some intermittent reports of argocd locking up during provisioning. we’ve just released v2.3.3 to mitigate a circumstance that caused argocd to get stuck in a progressing state. should you run into a state where argocd is locked up, you can resolve it with the command:
kubectl -n argocd get deploy/argocd-server -oyaml | kubectl replace -f -