fluxcd / flux

Successor: https://github.com/fluxcd/flux2

Home Page:https://fluxcd.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Deletion of resources not in repo

errordeveloper opened this issue · comments

Currently, we don't provide deletion. I'm sure there was a discussion of how we could do it earlier, but I've not found a dedicated issue so I thought we should have one.

How would you want it to work?

How would you want it to work?

@squaremo I think it'd be fair to say that everything we create based on config that we see in git should be deleted in the case it no longer exists in git.

One of the discussion points could be kubectl apply --prune however this alone doesn't help very much, as it revolves around a selector and an exclude list; I've not tried it and it's still alpha. Not sure it is of any use to us really.

A very naive solution would be to track things we create and delete when those things no longer appear in the current set of config. However, that may be tricky... I'd consider looking at diff between now and previous sync, and see whether that includes file removal, and if so, try removing those.

@rndstr suggested that we could use annotation (which makes sense in any case) and that we could also consider not deleting anything automatically, and instead ask the user whether there is stuff they'd want to delete.

I'd consider looking at diff between now and previous sync, and see whether that includes file removal, and if so, try removing those.

That approach probably won't tolerate aggressive history rewrites... but it'd probably be okay to ask for user's attention when we are uncertain about whether to delete something or not.

User feedback: https://weave-community.slack.com/archives/C4U5ATZ9S/p1508346257000335

wrt deleting things not in the repo: I would likely not enable this even if it were available. we deploy things that we don’t actively manage like default/kubernetes

For our use case, we were investigating the usage of Flux as a kind of syncer between a cluster and a monorepo containing a series of resource manifests. Ideally, Flux would ensure that the state of the cluster matches the state of the monorepo (any new resources in the monorepo --> get created in the cluster, any resources removed from the monorepo --> get deleted from the cluster).

A sketch of a design:

  • you can switch automatic deletion on with an argument to the daemon
  • the daemon has an argument giving a deletion grace period, which defaults to say 5m
  • if a resource in the cluster is not present in the repo, it is annotated with a deletion deadline now + deletion grace period
  • if a resource in the cluster is not present in the repo and has a deletion deadline in the past, it is deleted.

Refinements:

  • you can also annotate a resource with a grace period, which is used instead of the daemon argument
  • you can exclude specific resources or namespaces from deletion
  • automatic deletion and the grace period are a property of the repo configuration, when we support multiple sync repos

Great to see this being discussed - this is a critical feature for us as we consider deprecating our previous jenkins/landscaper/helm based deployer in favour of flux or an in-house equivalent.

I like the approach suggested but I do wonder if it would be appropriate to use in the kube-system namespace (we are moving towards a deployer per customer and would ideally like to treat deployments to kube-system in a similar fashion.). The issue being that there is so much present in kube-system after cluster bootstrap (e.g. kops addons) that we'd need flexible ways to exclude resources - Ideally a list of label selectors.

Using a list of key:value labels would also allow us to exclude the resources installed via helm by adding heritage=Tiller to that list. Even if we move to flux we'd like to support charts (possibly via HelmCRD)

Would you be in a position to give any indication of how high priority this is for you ? Ideally a timeframe, if possible although that may be too much to ask.

I like the approach suggested but I do wonder if it would be appropriate to use in the kube-system namespace (we are moving towards a deployer per customer and would ideally like to treat deployments to kube-system in a similar fashion.). The issue being that there is so much present in kube-system after cluster bootstrap (e.g. kops addons) that we'd need flexible ways to exclude resources - Ideally a list of label selectors.

If I understand, you would like to be able to immunise some resources from deletion, where the resources are not necessarily known ahead of time? Yeah, selectors is a good solution for that. We'd have to figure out where the selector goes. If the policy is reasonably slow-changing, maybe just in a configmap (it'd be good to move other stuff into a config map too).

Re priorities: this isn't on the radar for weaveworks right at this minute, but may become more important when we look at supporting multiple backing repos.

Thanks for the update. And yes, we don't know the resource names themselves ahead of time and there are a lot of them. After the cluster is bootstrapped it kube-system already contains a large selection from here: https://github.com/kubernetes/kops/tree/master/addons but it would depend on the cluster config. The selector list is also much more succinct than the resource list.

The selector list is also much more succinct than the resource list.

Got it, that makes sense. What do you think about this being an item of configuration, i.e., set in a configmap? It would be composed with the per-resource annotations, and look like (off the top of my head):

deletion:
  defaultGracePeriod: 5m
  exclusions:
  - ruleName: dbshell # used in log messages at least
    namespace: default
    selector:
      name: dbshell

Yes, I like both the use of a configmap and the format. It feels generally useful and flexible.

Agree with @cptbigglesworth, this is a critical feature. It would give us an incentive to move away from a push based CI/CD system.

Is anyone working on this, @squaremo ?

Ouch, I was getting really excited for Flux as a replacement for our in house git ops solution, but without deletion it would be a large step back.

How about annotations on resources that can be automatically deleted if an equivalent can't be found in manifests. This would be a deletion whitelist approach. Anything with --auto-delete-annotation <value> could be deleted from kubernetes no longer in the manifests repo.

A deletion black list approach is tricky because it relies on annotating resources not in the manifest repo, the very things we have the least control/knowledge of.

If automating the deletion of deployments requires to much work then one way that I think can be implemented in a way that doesn't change much in the current model is to update the "list controllers" response by adding a field containing which file the controller originates from. The team that I'm working with would be able to use this information to manually handle any removal on our side by:

  1. Selecting a controller
  2. Fetching file managing the controller
  3. Removing the file from repository
  4. Calling "sync" to manually synchronize flux with the git repository

Then we can also do something like this (on our side):

  1. Search for all Kubernetes objects with the same name in the same namespace as from in the file
  2. Delete all Kubernetes objects found in 1

This would make it possible to somewhat "automate" the removal of Kubernetes objects managed by flux and the only thing that's missing right now is the knowledge of which files are managed by what controllers.

A sketch of a design

That is quite nice, but IMO is missing one desirable property: flux should never delete something it didn't create or asserted the existence of previously.

I am also concerned that a flux mis-configuration - say accidentally pointing flux at an empty git repo - would wipe out the cluster.

An alternative approach is to instruct flux to delete workloads via special 'commands' in the git commit, similar to the commands github has for closing PRs. e.g. 'deletes weave/deployment:prometheus'. Then, if flux cannot find the workload in git, and it can find such a commit message after the workload was last in git, then flux would delete the workload.

Another approche could be:

  1. annotate the resources with flux.weave.works/delete: "true"
  2. commit changes
  3. Flux checks if the the marked for deletion resources are present on the cluster and removes them
  4. Flux deletes the annotated files and commits the changes to git

Step 4 would have to handle cases where a single file contains definitions for multiple workloads.

@stefanprodan

Flux deletes the annotated files and commits the changes to git

This approach is nice because it's simple to understand.


Another alternative would be to maintain a list of resources flux has "seen". A resource is "seen" if flux has had some kind of control of it i.e. it has appeared in a k8s config that flux is syncing. The list would be commited to git.

If flux notices a resource in this "seen" list but with no corresponding config, it would delete the resource from the "seen" list and the cluster.

This would prevent deletion of resources which flux is not controlling. For example, kube-system addons if they are not in the k8s flux is cofigured to look at.
Pointing flux to an empty repo will also have no effect because the "seen" list would be empty.

Example seen list:

resources_seen:
  - deployment:default/authfe
  - deployment:default/users
  - daemonset:default/other

Flux deletes the annotated files and commits the changes to git

This approach is nice because it's simple to understand.

I don't think it is simple to understand, because it drives the changes to git from the cluster, rather that the other way around which is supposed to be the case.

There's also mechanical problems: do you remove the resource from the cluster first, then the file (or piece of file) from git, or the other way around -- and what happens in either of those cases if you fail in-between the two steps?

Actually, I've misunderstood Stefan's suggestion I think: the idea in step 1 is that you put an annotation in the manifest file. In that case, my objections are weaker. If you fail to delete the resource in the cluster, you will still have the annotation in git, and will try again next time.

Another (weakish) objection is then that deleting something becomes somewhat indirect -- and I still don't like the inverted control, though mechanically it's workable.

[The sketch above] is quite nice, but IMO is missing one desirable property: flux should never delete something it didn't create or asserted the existence of previously.
I am also concerned that a flux mis-configuration - say accidentally pointing flux at an empty git repo - would wipe out the cluster.

I'm inclined to agree with this principle and the accompanying objection, especially since it would allay many of the concerns expressed above; e.g., it would not remove things from kube-system (unless they had been in git after all).

An alternative approach is to instruct flux to delete workloads via special 'commands' in the git commit, similar to the commands github has for closing PRs. e.g. 'deletes weave/deployment:prometheus'. Then, if flux cannot find the workload in git, and it can find such a commit message after the workload was last in git, then flux would delete the workload.

This has the property (I think?) that you can always tell what to do, given the git repo and the cluster, even if you have never seen either before.

A downside is that it is yet another way of doing things :-( And, of course, prone to typos, forgetfulness, and so on. In the absence of a robust implicit mechanism, it might be a goer, though.

This has the property (I think?) that you can always tell what to do, given the git repo and the cluster, even if you have never seen either before.

Yes, though perhaps not quite the way I intended. Consider the case where a cluster and git repo meet for the first time, and the git repo used to contain some workload definition and has a subsequent delete commit satisfying the criteria I mentioned. Would we really want that workload to be deleted? As far as we know, flux never created it or asserted its existence, thus violating the desirable principle I proposed.

A downside is that it is yet another way of doing things :-( And, of course, prone to typos, forgetfulness, and so on.

Yes. Cue empty commits with I forgot to tell flux to delete x/y:z :(

The approach we took was based on these requirements :

  1. The applier should only garbage collect objects it created
  2. There can be multiple appliers per cluster

To achieve this, the applier (read flux or kube-applier) should label objects that it creates, and we configure this using a list of Key Value labels, conventionally using a single K-V pair with applier as the key :

    runner:
      labels:
        applier: "a-unique-name-for-this-instance-of-the-applier"

We can then choose to enable garbage collection and limit the collection to objects which have this label :

    garbage_collector:
      object_selector:
        applier: "a-unique-name-for-this-instance-of-the-applier"

As an optimization I believe git diff is used to determine if files have been removed since the last garbage collection run. This model seems easy to reason about, if it suits you.

A nice side effect of using labels is you can list all objects under the jurisdiction of a single applier.

We've forked kube-applier for now, but I did really like aspects of flux too !

Until this is supported I will just create a script in our CI system which parses the git history to see which manifests have been removed, then warns the user (either in CI or slack). Additionally it could suggest a command to remove those resources, but the step would still remain manual.

Of course this won't cover all resources, which is why it would need to be manual.

I have the impression that because of this, a rename of a deployment only creates a new deployment, but does not remove the old one. The problem with that is that we have certain workloads of which only 0 or 1 instances may be running (e.g. a consumer of a queue that needs to keep order). So we really want the rename to also delete the old deployment next to creating a new deployment.

We where also experimenting with "Jobs" for database migrations, but turns out:
"You can't mutate certain parts of the job after it is created.". So if we cannot delete and create a new job, we are hitting this limitation.

ts=2018-08-31T11:54:42.190903393Z caller=loop.go:197 component=sync-loop err="argus:job/argus-dashboard-migration: running kubectl: The Job \"argus-dashboard-migration\" is invalid: spec.template: Invalid value:  core.PodTemplateSpec{ ... }: field is immutable"

On a more fundamental level, this seems to break the invariant that "Flux keeps the state of the cluster in sync with the content of the git repo" (we now need an "out-of-band", manual touch with kubectl to delete the old deployment). And ... that manual out-of-band touch is not audited in Flux? That seems to break another invariant "Flux audit logs all changes to the cluster" ?

My experiment (in the flux namespace, but this was just an experiment):

  • create a deployment and then change the name like this:
 apiVersion: apps/v1
 kind: Deployment
 metadata:
-  name: deployment-alpine
+  name: deployment-alpine-new
   namespace: flux
   labels:
     app: alpine

The result of pushing this change now is:

err=null output= [lay-out : replaced newlines added by me]
  "namespace \"argus\" configured
  namespace \"flux\" configured
  namespace \"play\" configured
  service \"argus-dashboard\" unchanged
  serviceaccount \"flux\" unchanged
  clusterrole \"flux\" configured
  service \"memcached\" unchanged
  clusterrolebinding \"flux\" configured
  secret \"flux-git-deploy\" unchanged
  configmap \"flux-ssh-config\" unchanged
  ingress \"ingress-dashboard\" unchanged
  deployment \"argus-dashboard\" unchanged
  deployment \"deployment-alpine-new\" created  ## this is the create, but no delete
  deployment \"flux\" unchanged
  deployment \"memcached\" unchanged"

And now we see these two deployments:

 kubectl -n flux get all                                                                                                          
NAME                                        READY     STATUS    RESTARTS   AGE
pod/deployment-alpine-c868f6fbf-b8lwn       1/1       Running   0          29m
pod/deployment-alpine-new-c868f6fbf-r6h6z   1/1       Running   0          28m
pod/flux-6d769cbcb7-5l8b7                   1/1       Running   0          18h
pod/memcached-6bdf479df8-w7tbp              1/1       Running   0          4d

NAME                TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)     AGE
service/memcached   ClusterIP   None         <none>        11211/TCP   7d

NAME                                    DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/deployment-alpine       1         1         1            1           29m
deployment.apps/deployment-alpine-new   1         1         1            1           28m
deployment.apps/flux                    1         1         1            1           7d
deployment.apps/memcached               1         1         1            1           7d

NAME                                              DESIRED   CURRENT   READY     AGE
replicaset.apps/deployment-alpine-c868f6fbf       1         1         1         29m
replicaset.apps/deployment-alpine-new-c868f6fbf   1         1         1         28m
replicaset.apps/flux-6d769cbcb7                   1         1         1         1d
replicaset.apps/flux-77ff458dc4                   0         0         0         1d
replicaset.apps/flux-856f8b475d                   0         0         0         1d
replicaset.apps/memcached-6bdf479df8              1         1         1         7d

Another approche could be:

  1. annotate the resources with flux.weave.works/delete: "true"
  2. commit changes
  3. Flux checks if the the marked for deletion resources are present on the cluster and removes them
  4. Flux deletes the annotated files and commits the changes to git

This approach is nice because you get a record of it happening in git (for both a human asking the thing to be deleted, and the thing actually being deleted).

It might be good to use Kontena's Pharos Cluster as inspiration. They have an addons system which works really well, propagating changes and the ability to delete resources.

Most of this logic can be found in k8s-client.

In a nut shell, what it appears to do:

  1. For a set of resources (yaml files), a cumulative checksum is generated
  2. For each resource: they are created if missing or updated in place if checksum has mismatch
    • these resources get decorated with a resource collection identifier (static) and the cumulative checksum (dynamic)
  3. After changes are applied, all cluster resources are listed that contain the resource collection identifier (static) and pruned/deleted if the checksum mismatches (meaning the file was most likely deleted off disk)

It'd be awesome if this logic could be applied by a feature flag/cli option, repo-wide.
Resources that shouldn't be auto-pruned could be decorated with flux.weave.works/delete: "false", where flux.weave.works/delete: "true" is the default if no annotation exists.

It might be good to use Kontena's Pharos Cluster as inspiration. [...]

That is an pretty clean way of going about it.

If apply fails for a resource, that resource will get deleted, whereas you'd probably want it to just stay where it is until a human can fix things. Perhaps just the rule "don't delete anything that failed to apply" would avoid that.

I think it'd be reasonable to apply changes one at a time, grouped by folder and then alphabetically sorted. This would allow users to express order and logical groups by naming files 01-...yml, 02-...yml, 03-...yml, etc.
For minimizing blast radius, each top level folder could be its own stack.

When an error is encountered I assume the application of changes would be aborted and the prune step skipped, allowing a human operator to intervene.
Maybe reasonable efforts could be made to validate the changes before the apply is started (e.g. validate yaml format or run through kubernetes validation)?

Additionally, maybe it would be best to revert the commit if the application failed. This behavior could be opt-in.
The whole commit could be reverted or only the affected stack(s) if each top level folder is its own stack.

edit: for clarity, application is referring to applying the yaml files

@stefanprodan I like your suggestion but would be more inclined to suggest rather than delete: true to use something like ensure: present and ensure: absent? So maybe by default all resources are created with ensure: present and if not purges said resources?

@hiddeco is it possible to enable garbagecollection but in a "dry-run" or "noop" state so can determine if it will behave as expected rather than accidentally deleting resources you didn't mean for it to do?

I should really close this, since garbage collection was implemented in #1442 and released in v1.11.0.
Thanks everyone for the discussion! @cdenneen I've created an issue for your latter suggestion (#1990).

The FAQ still claims deletion is not possible, and links to this issue. Though from my quick scan, it seems like that is out of date.

https://docs.fluxcd.io/projects/helm-operator/en/latest/faq.html#i-ve-deleted-a-helmrelease-file-from-git-why-is-the-helm-release-still-running-on-my-cluster
Snapshot: http://archive.is/tmd4S

Might want to open that as a new issue @nickatsegment since this is closed it might be overlooked but I'm here via the FAQ so...