kubefirst / kubefirst

The Kubefirst Open Source Platform

Home Page:https://docs.kubefirst.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

argocd gets stuck during civo management cluster provisioning

johndietz opened this issue · comments

in 2.4.0 through 2.4.2, we have encountered an intermittent issue of argocd getting stuck shortly after taking on the sync wave where it puts itself under its own management.

when this issue presents itself, if you port forward to argocd, you can observe deployments in argocd progressing with the status "Waiting for rollout to finish: observed deployment generation less than desired generation" which will never complete.

it's a bad circumstance to fall into, as there is no known remedy that will bring the cluster back cleanly under argocd's control other than a full management stack reprovision. we've tried orphaning resources, removing argocd and adding it back, among many other hammer-oriented techniques without revitalizing a cluster that gets in this state.

anecdotally, this seems to be more prevalant on civo github than the other stacks. we will continue to prioritize this issue as a top priority, and expedite its resolution.

if you have encountered this issue, please comment with your git and cloud provider details.

this may be related to unresolved issue
argoproj/argo-cd#14266

some more details:

  1. when we install argocd to the management cluster, we kustomize against an upstream cloud installation:
    https://github.com/kubefirst/manifests/blob/main/argocd/cloud/kustomization.yaml

  2. there's a spot in our orchestration where we update argocd with the newly created oidc configurations. to do so, we gitops a similar gitops kustomize install, but bound to your own gitops repo with details specific to your domain, sso, etc.
    https://github.com/kubefirst/gitops-template/blob/main/civo-github/templates/mgmt/components/argocd/kustomization.yaml#L9-L11

  3. we then restart argocd-server after applying the update to argocd's config in order to have those new vault sso settings take effect:
    https://github.com/kubefirst/gitops-template/blob/main/civo-github/templates/mgmt/components/argocd/argocd-oidc-restart-job.yaml#L58

  4. when argocd-server wakes up from the restart, it’s mysteriously unable to reconcile or manage many of the deployment or statefulset resources that’s it’s tracking.

  5. misc relevant details:

  • so far we've only observed this on resources that were created after the argocd restart sync wave
  • this is only occurring on civo cloud, we've audited across argocd registry setup and api sequencing shows no inconsistencies with the other clouds
  • earlier binary known to work on civo (2.3.8) now also exhibiting this behavior

this has been fixed in 2.4.3 🚀

hey everyone - we’ve had some turbulence this week provisioning management clusters with some intermittent reports of argocd locking up during provisioning. we’ve just released v2.3.3 to mitigate a circumstance that caused argocd to get stuck in a progressing state. should you run into a state where argocd is locked up, you can resolve it with the command:
kubectl -n argocd get deploy/argocd-server -oyaml | kubectl replace -f -