chart is upgraded to out-of-sync git repo
amit-handda opened this issue · comments
Describe the bug
helmrelease X is upgraded to an out-of-sync chart sources(git repo) before being correctly upgraded to in-sync chart sources (after 10 mins).
To Reproduce
We were not able to reproduce it. looks like a corner case. we dived into the code to see if we could detect something, however, not much progress there either.
timeline of events:
we have two branches master and feature-branch. both contain identical chart sources.
timeline:
t0: X's helmrelease.spec.chart.ref is updated from feature-branch -> master
t0+2mins: X is upgraded to out-of-sync chart sources
t0+12mins: X is upgraded to in-sync chart sources
Steps to reproduce the behaviour:
- helm-operator setup
- args:
- --enabled-helm-versions=v3
- --log-format=fmt
- --git-timeout=20s
- --git-poll-interval=5m
- --charts-sync-interval=3m
- --status-update-interval=30s
- --update-chart-deps=true
- --log-release-diffs=false
- --workers=4
- --tiller-namespace=kube-system
image: docker.io/fluxcd/helm-operator:1.1.0
- Provide a HelmRelease example
apiVersion: helm.fluxcd.io/v1
kind: HelmRelease
metadata:
name: logging-agent
namespace: infra
spec:
releaseName: logging
chart:
git: git@github.com:doordash/cluster-config
path: charts/logging-pipeline
ref: master
values:
agent:
...
valuesFrom:
- chartFileRef:
path: overrides/profiles/agent.yaml
- Post the HelmRelease status, you can get this by running
kubectl describe helmrelease <name>
❯ kd hr -n infra logging-agent
Name: logging-agent
Namespace: infra
Labels: fluxcd.io/sync-gc-mark=sha256.qbJla6Lq9WUGzqt2bFgLmyDoKp2k1FalMGLmbIOChlQ
Annotations: fluxcd.io/sync-checksum: 195945a0e6dcd70ff4f035cfcf639cb9e5040d32
API Version: helm.fluxcd.io/v1
Kind: HelmRelease
Metadata:
Creation Timestamp: 2020-06-11T23:27:26Z
Generation: 9
Resource Version: 143889620
Self Link: /apis/helm.fluxcd.io/v1/namespaces/infra/helmreleases/logging-agent
UID: a46ae65c-34b9-43b3-a3db-d88d6cb59d7c
Spec:
Chart:
Git: git@github.com:org/repo
Path: charts/logging-pipeline
Ref: master
Release Name: logging
Values:
Agent:
....
Status:
Conditions:
Last Transition Time: 2020-10-27T17:45:15Z
Last Update Time: 2020-10-29T15:46:32Z
Message: Chart fetch was successful for Helm release 'logging' in 'infra'.
Reason: ChartFetched
Status: True
Type: ChartFetched
Last Transition Time: 2020-10-29T15:46:44Z
Last Update Time: 2020-11-05T16:56:17Z
Message: Release was successful for Helm release 'logging' in 'infra'.
Reason: Succeeded
Status: True
Type: Released
Last Attempted Revision: bef6580bd535cb61c0c94b92d6f91f6fd1ddf87c
Observed Generation: 9
Phase: Succeeded
Release Name: logging
Release Status: deployed
Revision: bef6580bd535cb61c0c94b92d6f91f6fd1ddf87c
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ReleaseSynced 6m42s (x165 over 22h) helm-operator managed release 'logging' in namespace 'infra' synchronized
Expected behavior
the helm upgrade should not have been to out-of-sync chart sources.
Logs
dont have the logs related to that timeline.
Additional context
- Helm Operator version: 1.1.0
- Kubernetes version: 1.17.9
- Git provider: github
- Helm repository provider: github git-repo
Hello folks,
@amit-handda collegue here.. i'd like to give a little bit of more context to this issue.
our repo structure:
/charts/
-- logging-chart
-- monitoring-chart
-- etc...
/workloads/
-- prod-cluster-01/
---- helmRelease yaml manifests to install
-- dev-cluster-02
---- helmRelease yaml manifests to install
-- staging-cluster-02
---- helmRelease yaml manifests to install
in order to update a chart we create a feature branch feat-232
off master branch.
to test and progressively release the changes to dev, staging and lastly prod we update the helmRelease in the workloads folder:
...
chart:
git: git@github.com:<this-repo>.git
path: charts/logging-chart
ref: feat-232
this correctly installs thee new chart based on feat-232 on the destination cluster.
Once we are ready for the last update, we merge feat-232 and in the same PR we rollback the ref
field to master for all the cluster in the workloads folder.
When we merge this last PR we noticed that on some clusters (not all of them) a re-deploy happens when it shouldn’t.
last time we noticed 2 helm release version happening in few minutes:
sh.helm.release.v1.logging.v26 helm.sh/release.v1 1 17h
sh.helm.release.v1.logging.v27 helm.sh/release.v1 1 16m
sh.helm.release.v1.logging.v28 helm.sh/release.v1 1 8m3s
looking at the diff between the releases we see:
v26 applies the changes
v27 remove the changes
v28 applies the changes
We believe that there is a sync problem between the event of CRD update and the branch used to calculate the diff.
Would love your inputs on how this problem can manifest itself, we will be happy to contribute in getting this fixed.
Given we are in maintenance mode, it would be great to know if this issue is reproducible in the latest Helm Operator version (v1.2.0
) before taking further action.
hello @hiddeco, thank you for the reply.
We are in the process of multiple updates and we have in the pipeline also the flux one.
With that being said, updating atm doesn't guarantee that this "global" deploy problem goes away... and we would like to fix this problem we are having in the short term.
We would love and input from one of you folks on what can it be (because on a first pass on the codebase we didn't find anything) and we will be happy to dig further and open a PR.
I can guess at what's happening here: the code mirroring the git repository does so asynchronously, and when you update the custom resource to switch branches, there's a race between the upgrade running and the mirror having fetched the merge commit.
The relevant code is around here: https://github.com/fluxcd/helm-operator/blob/master/pkg/chartsync/git.go. This matches the git repos used in HelmRelease resources with those currently being mirrored; when there's a lack, a new mirror is set up. But if a mirror already exists, it's assumed that it is usable.
I don't think this design accounts for the use you are putting it to. When you switch branches, there is nothing that makes sure it has the current head of the updated branch. (Why doesn't this cause a problem when you switch to the feature branch? If the ref is missing entirely, it error-loops until it's mirrored that ref.)
To work around it, you could update the git ref used in the HelmRelease to the merge commit, rather than master; and once that is released, you know it has that particular commit and it's safe to switch to master branch.
@squaremo thanks for the detailed explanation.
If our upgrade procedure isn’t well suited for the current state of the code, what would it be a suggested upgrade strategy?
How do you do it at Weaveworks?
Thanks!
We don't use Helm charts in Weave Cloud -- it's all driven by updating images automatically, and occasionally adapting config by hand. We sync the main branch, and rely on mistakes showing up in dev before the same change is made for production. I would not hold this up as the ideal process though.
If you can manage the workaround (instead of reverting the chart ref, change it to a specific revision before proceeding), that's what I'd do. As a more future-proof alternative, consider porting your system to use the helm-controller
(which is like a Helm operator v2).
@squaremo thank you for responding. could you confirm if this issue wont be there if we port our system to the new helm-controller ? thanks
could you confirm if this issue wont be there if we port our system to the new helm-controller ?
I expect that when you changed the branch back to master
, the source-controller (which does git mirroring in v2) would simply clone the repository again, and thereby pick up the merge commit. The relevant code is in https://github.com/fluxcd/source-controller/blob/main/controllers/gitrepository_controller.go#L161
But don't rely on my head-computer -- I recommend making a throwaway environment to persuade yourselves it will work with your workflow. Getting all the bits running is easier than with helm-operator -- see https://toolkit.fluxcd.io/get-started/.
@squaremo we have analyzed the code. there will always be a possibility that whenever we switch the helmrelease's gitrepo branch (feature-branch -> master, eg), that mirror ll be out of sync. We have created a PR to detect the switch and avoid the incorrect helm reconciliation by trigger the mirror sync upfront. (in the above PR)
Please review it and let us know if its acceptable. thanks