fluxcd / notification-controller

The GitOps Toolkit event forwarder and notification dispatcher

Home Page:https://fluxcd.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Alerts duplicate error messages in slack channel

dimakyriakov opened this issue · comments

Problem:
I created an alert to monitor all helmreleases in a specific namespace and it's making huge traffic of errors in a slack channel.
It duplicates errors every certain period and alerting despite no changes to helmreleases.

Here is alert file:

apiVersion: notification.toolkit.fluxcd.io/v1beta1
kind: Alert
metadata:
  name: integration
  namespace: flux-system
spec:
  summary: "integration"
  providerRef:
    name: slack
  eventSeverity: info
  eventSources:
    - kind: HelmRelease
      namespace: integration
      name: '*'

Question:
Is it possible to trigger an alert only if we made changes to helmrelease, not the status of an existing one?
Is it possible to not duplicate alert message that we already received after certain period of the time?

commented

Hi, can you share some example error messages to help understand what type of events from helm-controller are causing this?
Notification-controller already has rate limiting to prevent duplicate events for a period of 5 minutes by default. After 5 minutes, you'll receive an alert if the same event is received again. I think that's what's happening in this case.
It may be an issue in the helm-controller which is sending such error events, which may need attention. Maybe some change in helm-controller or HelmRelease would help suppress or fix the errors.
If these errors aren't actionable, you can ignore them in notification-controller Alerts by defining an ExclusionList, see https://fluxcd.io/flux/components/notification/alert/#specification .

image
some of our helmreleases has error "reconciliation failed: install retries exhausted" and mostly we are ok with it
it would be nice to only get this error once when it appears

@dimakyriakov by design, error alerts are sent every 5 minutes until they are resolved. You can increase the interval with --rate-limit-interval, flags docs here https://fluxcd.io/flux/components/notification/options/

thank you for response Guys, i will close the ticket

@stefanprodan, hey, i just want to ask where exactly I can set --rate-limit-interval option?
I created provider and alert in yaml file. For me it looks like it's options for cli.

That’s a controller flag, see here how to change them https://fluxcd.io/flux/cheatsheets/bootstrap/

This is my kustomization.yaml file. You mean I have to increase --rate-limit-interval for name: notification-controller?

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- gotk-components.yaml
- gotk-sync.yaml
patchesStrategicMerge:  # these are tuned for demonstration and debugging
- |-
  apiVersion: kustomize.toolkit.fluxcd.io/v1beta2
  kind: Kustomization
  metadata:
    name: flux-system
    namespace: flux-system
  spec:
    patches:
    - target:
        version: v1
        group: apps
        kind: Deployment
        name: notification-controller
        namespace: flux-system
      patch: |-
        - op: add
          path: /spec/template/spec/containers/0/args/-
          value: --rate-limit-interval=10s  # do not discard messages that are sent again after 10s+
    - target:
        version: v1
        group: apps
        kind: Deployment
        name: kustomize-controller
        namespace: flux-system
      patch: |-
        - op: add
          path: /spec/template/spec/containers/0/args/0
          value: --concurrent=5             # increase the number of Kustomizations processed at once
        - op: replace
          path: /spec/template/spec/containers/0/resources/limits/cpu
          value: "2"                        # allow KC access to more CPU
        - op: replace
          path: /spec/template/spec/containers/0/resources/limits/memory
          value: "2Gi"                      # allow KC access to more memory
    - target:
        version: v1
        group: apps
        kind: Deployment
        name: source-controller
        namespace: flux-system
      patch: |-
        - op: replace
          path: /spec/template/spec/containers/0/resources/limits/cpu
          value: "2"                        # allow KC access to more CPU
        - op: replace
          path: /spec/template/spec/containers/0/resources/limits/memory
          value: "2Gi"                      # allow KC access to more memory
    - target:
        version: v1
        group: apps
        kind: Deployment
        name: helm-controller
        namespace: flux-system
      patch: |-
        - op: add
          path: /spec/template/spec/containers/0/args/0
          value: --concurrent=12             # increase the number of HelmReleases processed at once
        - op: replace
          path: /spec/template/spec/containers/0/resources/limits/cpu
          value: "2"                        # allow KC access to more CPU
        - op: replace
          path: /spec/template/spec/containers/0/resources/limits/memory
          value: "2Gi"                      # allow KC access to more memory
--rate-limit-interval=10s  # do not discard messages that are sent again after 10s+

No wander you get alert spam, the default is 5m, you can increase it to a value to fits for you.