fluxcd / helm-operator

Successor: https://github.com/fluxcd/helm-controller — The Flux Helm Operator, once upon a time a solution for declarative Helming.

Home Page:https://docs.fluxcd.io/projects/helm-operator/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

chart is upgraded to out-of-sync git repo

amit-handda opened this issue · comments

Describe the bug

helmrelease X is upgraded to an out-of-sync chart sources(git repo) before being correctly upgraded to in-sync chart sources (after 10 mins).

To Reproduce

We were not able to reproduce it. looks like a corner case. we dived into the code to see if we could detect something, however, not much progress there either.

timeline of events:
we have two branches master and feature-branch. both contain identical chart sources.
timeline:
t0: X's helmrelease.spec.chart.ref is updated from feature-branch -> master
t0+2mins: X is upgraded to out-of-sync chart sources
t0+12mins: X is upgraded to in-sync chart sources

Steps to reproduce the behaviour:

  1. helm-operator setup
      - args:
        - --enabled-helm-versions=v3
        - --log-format=fmt
        - --git-timeout=20s
        - --git-poll-interval=5m
        - --charts-sync-interval=3m
        - --status-update-interval=30s
        - --update-chart-deps=true
        - --log-release-diffs=false
        - --workers=4
        - --tiller-namespace=kube-system
        image: docker.io/fluxcd/helm-operator:1.1.0
  1. Provide a HelmRelease example
apiVersion: helm.fluxcd.io/v1
kind: HelmRelease
metadata:
  name: logging-agent
  namespace: infra
spec:
  releaseName: logging
  chart:
    git: git@github.com:doordash/cluster-config
    path: charts/logging-pipeline
    ref: master

  values:
    agent:
      ...
  valuesFrom:
  - chartFileRef:
      path: overrides/profiles/agent.yaml
  1. Post the HelmRelease status, you can get this by running kubectl describe helmrelease <name>
❯ kd hr -n infra logging-agent
Name:         logging-agent
Namespace:    infra
Labels:       fluxcd.io/sync-gc-mark=sha256.qbJla6Lq9WUGzqt2bFgLmyDoKp2k1FalMGLmbIOChlQ
Annotations:  fluxcd.io/sync-checksum: 195945a0e6dcd70ff4f035cfcf639cb9e5040d32
API Version:  helm.fluxcd.io/v1
Kind:         HelmRelease
Metadata:
  Creation Timestamp:  2020-06-11T23:27:26Z
  Generation:          9
  Resource Version:    143889620
  Self Link:           /apis/helm.fluxcd.io/v1/namespaces/infra/helmreleases/logging-agent
  UID:                 a46ae65c-34b9-43b3-a3db-d88d6cb59d7c
Spec:
  Chart:
    Git:         git@github.com:org/repo
    Path:        charts/logging-pipeline
    Ref:         master
  Release Name:  logging
  Values:
    Agent:
        ....
Status:
  Conditions:
    Last Transition Time:   2020-10-27T17:45:15Z
    Last Update Time:       2020-10-29T15:46:32Z
    Message:                Chart fetch was successful for Helm release 'logging' in 'infra'.
    Reason:                 ChartFetched
    Status:                 True
    Type:                   ChartFetched
    Last Transition Time:   2020-10-29T15:46:44Z
    Last Update Time:       2020-11-05T16:56:17Z
    Message:                Release was successful for Helm release 'logging' in 'infra'.
    Reason:                 Succeeded
    Status:                 True
    Type:                   Released
  Last Attempted Revision:  bef6580bd535cb61c0c94b92d6f91f6fd1ddf87c
  Observed Generation:      9
  Phase:                    Succeeded
  Release Name:             logging
  Release Status:           deployed
  Revision:                 bef6580bd535cb61c0c94b92d6f91f6fd1ddf87c
Events:
  Type    Reason         Age                    From           Message
  ----    ------         ----                   ----           -------
  Normal  ReleaseSynced  6m42s (x165 over 22h)  helm-operator  managed release 'logging' in namespace 'infra' synchronized

Expected behavior

the helm upgrade should not have been to out-of-sync chart sources.

Logs

dont have the logs related to that timeline.

Additional context

  • Helm Operator version: 1.1.0
  • Kubernetes version: 1.17.9
  • Git provider: github
  • Helm repository provider: github git-repo

Hello folks,
@amit-handda collegue here.. i'd like to give a little bit of more context to this issue.

our repo structure:

/charts/
-- logging-chart
-- monitoring-chart
-- etc...
/workloads/
-- prod-cluster-01/
---- helmRelease yaml manifests to install
-- dev-cluster-02
---- helmRelease yaml manifests to install
-- staging-cluster-02
---- helmRelease yaml manifests to install

in order to update a chart we create a feature branch feat-232 off master branch.
to test and progressively release the changes to dev, staging and lastly prod we update the helmRelease in the workloads folder:

...
  chart:
    git: git@github.com:<this-repo>.git
    path: charts/logging-chart
    ref: feat-232

this correctly installs thee new chart based on feat-232 on the destination cluster.
Once we are ready for the last update, we merge feat-232 and in the same PR we rollback the ref field to master for all the cluster in the workloads folder.

When we merge this last PR we noticed that on some clusters (not all of them) a re-deploy happens when it shouldn’t.
last time we noticed 2 helm release version happening in few minutes:

sh.helm.release.v1.logging.v26                                     helm.sh/release.v1                    1      17h
sh.helm.release.v1.logging.v27                                     helm.sh/release.v1                    1      16m
sh.helm.release.v1.logging.v28                                     helm.sh/release.v1                    1      8m3s

looking at the diff between the releases we see:
v26 applies the changes
v27 remove the changes
v28 applies the changes

We believe that there is a sync problem between the event of CRD update and the branch used to calculate the diff.
Would love your inputs on how this problem can manifest itself, we will be happy to contribute in getting this fixed.

Given we are in maintenance mode, it would be great to know if this issue is reproducible in the latest Helm Operator version (v1.2.0) before taking further action.

hello @hiddeco, thank you for the reply.
We are in the process of multiple updates and we have in the pipeline also the flux one.

With that being said, updating atm doesn't guarantee that this "global" deploy problem goes away... and we would like to fix this problem we are having in the short term.

We would love and input from one of you folks on what can it be (because on a first pass on the codebase we didn't find anything) and we will be happy to dig further and open a PR.

I can guess at what's happening here: the code mirroring the git repository does so asynchronously, and when you update the custom resource to switch branches, there's a race between the upgrade running and the mirror having fetched the merge commit.

The relevant code is around here: https://github.com/fluxcd/helm-operator/blob/master/pkg/chartsync/git.go. This matches the git repos used in HelmRelease resources with those currently being mirrored; when there's a lack, a new mirror is set up. But if a mirror already exists, it's assumed that it is usable.

I don't think this design accounts for the use you are putting it to. When you switch branches, there is nothing that makes sure it has the current head of the updated branch. (Why doesn't this cause a problem when you switch to the feature branch? If the ref is missing entirely, it error-loops until it's mirrored that ref.)

To work around it, you could update the git ref used in the HelmRelease to the merge commit, rather than master; and once that is released, you know it has that particular commit and it's safe to switch to master branch.

@squaremo thanks for the detailed explanation.
If our upgrade procedure isn’t well suited for the current state of the code, what would it be a suggested upgrade strategy?
How do you do it at Weaveworks?
Thanks!

We don't use Helm charts in Weave Cloud -- it's all driven by updating images automatically, and occasionally adapting config by hand. We sync the main branch, and rely on mistakes showing up in dev before the same change is made for production. I would not hold this up as the ideal process though.

If you can manage the workaround (instead of reverting the chart ref, change it to a specific revision before proceeding), that's what I'd do. As a more future-proof alternative, consider porting your system to use the helm-controller (which is like a Helm operator v2).

@squaremo thank you for responding. could you confirm if this issue wont be there if we port our system to the new helm-controller ? thanks

could you confirm if this issue wont be there if we port our system to the new helm-controller ?

I expect that when you changed the branch back to master, the source-controller (which does git mirroring in v2) would simply clone the repository again, and thereby pick up the merge commit. The relevant code is in https://github.com/fluxcd/source-controller/blob/main/controllers/gitrepository_controller.go#L161

But don't rely on my head-computer -- I recommend making a throwaway environment to persuade yourselves it will work with your workflow. Getting all the bits running is easier than with helm-operator -- see https://toolkit.fluxcd.io/get-started/.

@squaremo we have analyzed the code. there will always be a possibility that whenever we switch the helmrelease's gitrepo branch (feature-branch -> master, eg), that mirror ll be out of sync. We have created a PR to detect the switch and avoid the incorrect helm reconciliation by trigger the mirror sync upfront. (in the above PR)
Please review it and let us know if its acceptable. thanks