fluxcd / notification-controller

The GitOps Toolkit event forwarded and notification dispatcher

Home Page:https://fluxcd.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Alerts sent to Alertmanager contain timestamps as labels, preventing alert grouping

tdemin opened this issue · comments

Description

notification-controller sets timestamp label on outgoing alerts, preventing Alertmanager from recognizing subsequent alerts with the same label set as a single alert (Alertmanager considers newly-posted alerts with an existing label set to be the same updated alert). This means Alertmanager will dispatch multiple alerts for what is essentially a single outage (e.g. failing GitRepository with a short reconciliation interval, e.g. 1m, which is the default for flux bootstrap-generated GitRepository)

This also requires us to add alert receivers with send_resolved: false separate from already-configured Prometheus receivers, as Flux alerts would be expiring at the default Alertmanager resolve_timeout of 5m, with Flux posting new ones over time.

Alertmanager (as of v0.27.0) seems to lack the ability to drop incoming alert labels.

Multiple alerts in Alertmanager

Proposal

While group_by combined with a long enough group_interval might make subsequent alerts less annoying, it still means we end up with multiple outgoing Alertmanager alerts in a single message, possibly grouped with the templates.

I propose removing this label altogether. Alertmanager already provides us with .StartsAt / .EndsAt, posting a new alert would just set the new "startsAt". This would technically be a breaking change, as existing users might rely on this label in their alert templates.

I'm willing to work on a PR for this if needed.

System details

Flux v2.2.3, notification-controller v1.2.4, Alertmanager v0.27.0

Minimal sample manifest set for testing:

apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: gitrepository-that-always-fails
spec:
  url: https://this-repo-doesnt-exist.example.com/repo.git
  interval: 1m
  ref:
    branch: master
---
apiVersion: notification.toolkit.fluxcd.io/v1beta3
kind: Alert
metadata:
  name: gitrepository-that-always-fails
spec:
  summary: A GitRepository is failing
  eventSeverity: error
  eventSources:
    - kind: GitRepository
      name: gitrepository-that-always-fails
  providerRef:
    name: alertmanager
---
apiVersion: notification.toolkit.fluxcd.io/v1beta3
kind: Provider
metadata:
  name: alertmanager
spec:
  type: alertmanager
  secretRef:
    name: alertmanager-config
---
apiVersion: v1
kind: Secret
type: Opaque
metadata:
  name: alertmanager-config
stringData:
  address: http://alertmanager:9093/api/v2/alerts
---
apiVersion: monitoring.coreos.com/v1
kind: Alertmanager
metadata:
  name: alertmanager
---
apiVersion: v1
kind: Service
metadata:
  name: alertmanager
spec:
  type: ClusterIP
  selector:
    alertmanager: alertmanager
  ports:
    - name: http-web
      port: 9093
      protocol: TCP
      targetPort: 9093

I propose removing this label altogether. Alertmanager already provides us with .StartsAt / .EndsAt, posting a new alert would just set the new "startsAt". This would technically be a breaking change, as existing users might rely on this label in their alert templates.

I agree that for Alertmanager removing the timestamp would be better as it's recored in startsAt. We can ship this breaking change in the next minor release Flux v2.3 if you can open a PR.