prometheus / alertmanager

Prometheus Alertmanager

Home Page:https://prometheus.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Feature request: provide a way to acknowledge a firing alert

prymitive opened this issue · comments

When a whole team tries to take an action on multiple alerts during an incident it requires a lot of communication effort to coordinate who is dealing with which alert.
Some systems (like PD) provide a way to acknowlage an alert and assign it to a specific person (usually the on-call person), but that requires to route every alert via such system. Also during an incident often people try to help and volunteer to handle some of alerts, so the usuall routing of alerts might not cover that.
It would be very useful to have some ability to mark an alert as "I'm working on that". This was discussed on the mailing list and one of the proposed solutions was to support auto-expiring silences (once alert is resolved silence is automatically expired regardless of endsAt value), which was previously suggested on #1057 but not accepted.

Human responses to alerts are out of scope for the alertmanager, this is better handled by a system such as PagerDuty. The Alertmanager is just about delivering Prometheus-generated notifications.

Is there any technical limitation that prevents auto-expiring silences from being implemented?
I think that auto-expiring silences are useful as a standalone feature and would be a good enough solution to the acknowlagement problem. Are those also out of scope or just the acknowlagement?

Auto-expiring silences are not wise and would be challenging to implement, see #1057.

See also https://www.robustperception.io/running-into-burning-buildings-because-the-fire-alarm-stopped

Also, this would make creating silences in advance impossible as they'd be auto-deleted.

Cross posting from the above linked email thread:

The only problem with using silences as a replacement is that if you already have plenty of silences (for broken hardware or other issues that take time to resolve) it becomes tricky to find all alerts that got acked but still require an action.
Silences that expire when alert is resolved sound very useful.

This is turning alertmanager into an incident response platform, when it's purpose is to group, deduplicate, and send the notification to the user's incident response provider of choice (pagerduty, ops genie, webhook, etc). To me, acknowledging that an alert has been received and it is currently being addressed makes the most sense at the end of this incident chain (prometheus->alertmanager->provider), rather than having two places to do it that could be out of sync.

(from the email chain)

Sometimes there's a flood of small issues and it's hard to tell who's fixing what just by looking at alerts.

If they are small issues that have been deemed not worthy of paging (i.e. being routed to pagerduty), a user creating the silence and writing in the comment metadata that they're working on it, and then deleting it when finished, seems appropriate. Just because an alert has stopped firing (and in this scenario, expires its silence), doesn't mean that the situation has resolved. Auto-expiring silences could lead to duplicate work more easily than the engineer responsible creating and manually expiring the silence when the alert has been resolved.

It would be very useful to have some ability to mark an alert as "I'm working on that". This was discussed on the mailing list and one of the proposed solutions was to support auto-expiring silences (once alert is resolved silence is automatically expired regardless of endsAt value), which was previously suggested on #1057 but not accepted.

I think we already support that with silences, and as stated above, I think making them auto-expire would end up being problematic.

I agree that "acknowledging" is not the right thing to do. On the other hand, I do think that automatically-expiring alerts have their place and are useful.

this would make creating silences in advance impossible

this can easily be solved.

Just because an alert has stopped firing, doesn't mean that the situation has resolved

That is true, but there are many situations where, when a specific alert has stopped firing, this means the situation. I would not make this behavior the default.

manually expiring the silence when the alert has been resolved

I don't want to do things manually that a computer can do for me. Sometimes, I'm not even awake when the situation resolves – say, a job that is failing because a dependency produced garbage data. I'm re-running the dependency, and I expect that once it finishes, the failing job will recover. If that happens, and later it fails again, I want to know immediately, because something else has happened. I would also want some threshold when I get notified anyway, even if it never recovered, but I do not want to stay up until 4am to manually expire a silence.

In #1057 @brian-brazil you said

[this] wouldn't work with AM clustering

Could you please elaborate on how exactly this would not work, where manually expiring silences does?

I would also want some threshold when I get notified anyway, even if it never recovered, but I do not want to stay up until 4am to manually expire a silence.

That's a matter of setting an appropriate expiry time on the silence.

Could you please elaborate on how exactly this would not work, where manually expiring silences does?

I can't remember offhand, but it was probably something to do with a network partition. What happens if one side deletes a silence and the other doesn't?

What would happen if I pressed "expire now" on one side of the partition?

a matter of setting an appropriate expiry time

I don't always know, by a factor of 2-4, how long this will take. I don't want to have to do a whole lot of math either, if I could rather say "whenever it's done is the right time, let me know if it's still an issue tomorrow morning".

Also, this would make creating silences in advance impossible as they'd be auto-deleted.

That's only if all silences auto expire rather than only those with a flag autoExpire: true.

Would it be possible to have ability to set extra annotations from Alertmanager itself that would be added to the firing alert? That way someone could add a note that persists only as long as the alert keeps firing.

a matter of setting an appropriate expiry time

We do this right now, it sorta works. You silence something for a few hours and it typically is enough. But once in a while you miss issue re-apearing after you thing you fixed it or you set too long expiry time and you forget to unsilence.

let me know if it's still an issue tomorrow morning".

I'd personally set it to tomorrow morning if it could wait, rather than risking waking myself up again.

the key word being "personally". I think this feature would not prevent you from following your style, but it allows others to use a different one that maybe works better for that specific circumstance.

As @prymitive said, this can lead to missing new events.

Pre-creating alerts would work as it does now if the auto-expiry only triggers on the N>0 -> N=0 transition of an alert group, or remembers that it has silenced at least one alert in the past.

That's additional state to manage. What happens if a Prometheus restarts, and takes long enough that the alerts resolve?

If that's the case, then you need to adjust the resolve_timeout anyway.

That's not relevant here as Prometheus is what chooses the end time, which is a few evaluation intervals so we could be easily talking less than a minute. Alerts flapping is normal, and we should be robust to it.

Alerts flapping is normal

Doesn't happen frequently from my experiance, as in I've never seen alerts that are flapping because of some misscommunication between prometheus and alertmanager, it's only a problem when Prometheus is down due to bad config restart. And if that's the case then is that really a blocker for this as it sounds like an unrelated problem (?).

I've never seen alerts that are flapping because of some misscommunication between prometheus and alertmanager

User bug reports indicate otherwise.

And if that's the case then is that really a blocker for this as it sounds like an unrelated problem

It'd affect the reliability of any such solution, as when things would get unsilenced is not predictable.

If there are users who hit flapping issues then they already have a problem, reliability doesn't get any worse then it already is for them. So should that really be a blocker? There's always a corner case for everything.

I wrote a tiny daemon that keep extending silences as long as there are alerts matching them.
This gives me pretty much what I want from acknowledgements.

When alert fires I'll silence it for 10 minutes with ACK! working on this, then the daemon will keep checking all silences where comment starts with ACK!, if they match alerts and would expire soon it will extend them by 15 minutes, if they no longer match any alerts then it will let them expire.

https://github.com/prymitive/kthxbye

How do you deal with clustering? Does one alertmanager get to decide to resolve the silence for all the alertmanagers? In such a way that silences would go away if one alertmanager is partitioned from one prometheus servers but not the other AM?

I don't deal with clustering at all.
All I need is an alertmanager api url, whatever that's a single instance or a cluster doesn't really matter.
If it's a cluster and it's in a split brain state then you'll have all problems of a split brain across your entire stack, for every component that uses alertmanager, kthxbye isn't in any way special here.

That was a question about how alertmanager could implement this, not a question about kthxbye :)

The issue with not dealing with it is how to debug why a notification is un-silenced in a clustered setup.

My bad, though you were responding to my comment

I have an proposal regarding this topic:

  • Add an "Infinite Duration" Switch to the "New Silence" interface
  • Add an "Expire on resolve" option to silences

The first one allows for a quicker creation of unlimited silences, which shall not be resolved by time, but by manual/automatic actions.

The second one lets a silence automatically expire when at least a single alert of the silenced alerts is resolved.
This could provide the "working on it"-silence-functionality, while also making sure that a reoccuring alert is catched because the silence was expired when the alarm resolved the first time.
Obviously this does not work for flapping alarms, but flapping instances/values could be catched by the Alert Rule definition in a way that the alert itself is not flapping.

What are the thoughts about this?

Add an "Expire on resolve" option to silences

Very much like the idea of this option (although it wouldn't be perfect for all cases). We tend to use long-term silences and periodically delete them to do this.