weaveworks / launcher

Weave Cloud Launcher

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Helm chart install is incompatible

bboreham opened this issue · comments

The selector in the helm chart has two labels, whereas the one we serve for auto-update has one:

2019-10-24T14:34:41.838259571Z time="2019-10-24T14:34:41Z" level=info msg="Updating self from https://get.weave.works/k8s/agent.yaml?instanceID=<redacted>"
2019-10-24T14:34:41.856270392Z time="2019-10-24T14:34:41Z" level=info msg="Revision before self-update: 2"
2019-10-24T14:34:42.861367005Z time="2019-10-24T14:34:42Z" level=error msg="Failed to execute kubectl apply: The Deployment \"weave-agent\" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{\"app\":\"weave-cloud\", \"release\":\"weave-cloud\", \"name\":\"weave-agent\"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable\nFull output:\nnamespace/weave configured\nserviceaccount/weave-agent unchanged\nclusterrole.rbac.authorization.k8s.io/weave-agent configured\nclusterrolebinding.rbac.authorization.k8s.io/weave-agent configured\nThe Deployment \"weave-agent\" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{\"app\":\"weave-cloud\", \"release\":\"weave-cloud\", \"name\":\"weave-agent\"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable"

@bboreham here is the plan based on what we discussed the other day, let me know if you agree and I'll start implementing it.

We don't know how many agents were installed from helm chart and are still stuck in this state (see #306). There is probably a way to find out by analysing the request log and cross-checking agents that hit agent.yaml and are unconnected.

Aside from the question of how many agents there are in this state, a solution to recover agents from this state would involve the following.

We add a new parameter to the agent.yaml, let's call it agent-generation. It would allow us to migrate from one version to another in the future.

So for this case we should be able to apply following migration plan:

  • release new agents that uses agent-generation=v1
  • update the service to support optional custom-selector=<labelSelector>
  • update helm chart to use new agent
  • wait some period of time (e.g. a day or two)
  • assume that all agents which hit the service without agent-generation set are broken
  • fix broken agents by redirecting agent.yaml to agent.yaml?agent-generation=v1&custom-selector=<helmLabels>
  • agent-generation and custom-selector are sticky, i.e. any requests with these params carry these forward as part of URL used for subsequent requests

This plan could work in theory, but there is a risks of further break in cases like the following:

  • agent was installed without helm, but didn't run at the time we executed the migration plan, e.g.:

    • cluster had zero nodes where agent could run and gets scaled up later
    • the agent was scaled to zero replicas and scaled up later
  • agent configuration is maintained by the user, e.g.:

    • they re-create their clusters and run kubectl apply -f agent.yaml (where agent.yaml is a copy of the config that they downloaded at some point)
    • they forked the helm chart, and possibly made further changes to it
  • user didn't use same helm install command and values of labels could different

I suppose all of these cases can be considered as unorthodox usage that we cannot support, but I'm not sure.

Aside from this, I am not quite sure the helm chart fix would stick - the helm community appears to mandate use of app and release labels as it currently stands in our chart.

Thinking about this again, a Weave Cloud helm chart should install just the launcher, then it can install everything else like it normally does.

This would parallel, for instance, https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack, which installs prometheus-operator which then in turn installs Prometheus and some exporters.
The Weave Cloud Launcher functions as an "operator".