prometheus / alertmanager

Prometheus Alertmanager

Home Page:https://prometheus.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Time of day based alert routing/notification

brian-brazil opened this issue · comments

We've had numerous requests for routing alerts based on the time of day/week. This issue is to track those.

So, @brian-brazil (@fabxc ?), could you provide any design guidance on how to implement said feature, as I think I'd like to suppress all 'severity: warning' alerts overnight rather than putting loads of time-based rule duplication into my Prometheus alert rules. I really don't want to do that as warnings are still valid & worth warning about if I go looking for current Alerts active - I just don't want to be woken up for them. They also shouldn't seem to resolve every evening & start again in the morning.

After being burnt wasting effort on #709 I want some suggestion up front of what might be accepted from the maintainers.

Indeed, it would be nice to see whether it is within the scope of AM. It'd be great to have this feature.

I would like to be able to have time of day, or day of week influence which receiver an alert is sent to. i.e. - during daytime/business hours, alerts might go to a slack channel, vs during night/weekends, same alerts might go to pagerduty, or on e-mail for the current on-call person.

From the Alertmanager perspective, it could be nice to use existing label matching to control routing to different receivers based on datetime.

i.e. -

- match_re:
      alertname: MyAlert
      day: '(Monday|Tuesday|Wednesday|Thursday|Friday)'
   receiver: slack_appteam
- match_re:
      alertname: MyAlert
      day: '(Saturday|Sunday)'
   receiver: pagerduty_appteam

Something similar for time of day? It's a little tricker, and in routing, it would be nice keep things simple... i.e. - match_re on a label like 'time_window: business_hours', but I'm not sure how to get that meaningful label in there from the alert manager perspective without some sort of relabeling within alertmanager itself, and prometheus passing along an alert date/time. I'm a Prometheus/Alertmanager newbie, so apologies in advance if I'm missing something obvious here.

An approach that generates meaningful date/time labels on the alerts, means those labels could also be used in inhibition as per @tyrken 's request to inhibit warnings for some or all alerts overnight.

One of my main drivers for this is to not introduce time based rule duplication in all my prometheus alerts, as that feels cumbersome.

The ability to subdue events based on the day and/or time would be really useful. I don't know what the best approach is here, but for Sensu they follow a re-usable pattern tied to the handler (receiver): subdue-attributes

+1 : Just here to says it would be a lovely feature to be able to sleep well during the night and have some alerts only during business hours / days

Adding https://golang.org/pkg/time/#Weekday for future reference. This could be implemented as a pipeline step that filters based on a defined day/time range. All times would be done in UTC.

This is a important missing feature! Either enable alerts during some time ranges, or allow recursive silent rules

Is there any workaround to silent staging/test/QA alerts during the night, but still receive then during the day?

@danielmotaleite I have a solution for this based on inhibition rules that doesn't require any change to AlertManager. I'll post the blog post address here once it is out.

AlertManager definitely needs a way for setting silence hours in config file. With label targeting, like it is in inhibit_rules.

@simonpasquier - waiting for that blog link!

@simonpasquier I am thirsty for this. 👍

Hello
Here is a starting point that we have used to silence alerts outside of office hours :

vector(1)
and on()
(
                  6 < hour(vector(time()))
and
                 hour(vector(time())) < 19 
and 
                   0 < day_of_week(vector(time()))
and 
                   day_of_week(vector(time())) < 6
)

Of course it's GMT based so does not take into account summer/winter times, neither bank holidays.

Hope it might helps others.

You should just have to replace vector(1) with the prometheus expression/aggregation you need

https://gist.github.com/roidelapluie/8c67e9c8fb18b310a4a90cb92a23056b

Our solution, with GMT and days off.

Then you do:

vector(1) and on() business_hour

That takes holidays in consideration.

PS: about daily_saving_time_belgium: yes it works.

I've written a blog post on how I solved my use case - link

@Tom-Fawcett this is so great!

@roidelapluie
One gotcha of the pure recording ruled based approach appears to be that a currently firing alert stops firing when that time of day boundary is crossed. It is then marked as resolved and potentially triggers a resolved notification which can be confusing to the responders, and looking back at the history in the TSDB.

Have you encountered this? If there's no good workaround, I plan to try the approach @Tom-Fawcett wrote up. It seems like it would avoid that particular issue.

Yes we have switched to inhibition now!! so much easier!! :)

I was considering the development of a calendar exporter. It would produce simple on/off status based on calendar rules.

It would be easier to handle specific cases (multiple time zone, reception rules, non-gregorian calendar) and any number of integrations could be considered.

IMHO it would be an elegant solution but at the cost of database space for dummy metrics.
What do you think ?

The main problem is that with such a thing, if it is down, Prometheus will fire many alerts. Maybe we could do a binary/script that would generate files suitable for alerting rules. Because it will be more reliable

Good point. I guess the same code able to generate metrics would be able to generate such a recording rule (in simple cases). Or, it could send the corresponding inhibition requests.

In my line of work (exchange market access for financial institutions), we have a lot of checks related to calendar, across multiple timezones. So it wouldn't be limited to alert inhibition, we also expect events to occur within a specific time frame.

@roidelapluie @michael-doubez just like any other exporter, you should have redundant instances in different zones running, so if one fails, you still get data from the other way

The idea of a exporter outputting data and time based rules is actually not bad, but developing one with enough features may be tricky! :)

commented

I am thirsty for this too. 👍

@danielmotaleite @roidelapluie @michael-doubez
Not sure if you still need this or not, but we had the same idea of writing an exporter to expose the current time. We actually put it up on our organizations github page.

https://github.com/OneMainF/time_range_exporter

Feel free to do with it what you will. It has support for adding custom durations and even exposes whether the current day is a national holiday. It's java based so take that how you will. I know different people feel differently about java.

Personally I feel that this is a nasty way of doing it though. I would prefer for this logic to live on the alert routing side than the actual alert itself.

The lack of this feature and the complex proposed workarounds make this a frustrating situation for people trying to migrate to Prometheus. In Nagios, I just define timeperiods in a very straightforward way and assign alerts to them.

timeperiod_name 24x7
  alias           24 Hours A Day, 7 Days A Week
  sunday          00:00-24:00
  monday          00:00-24:00
  tuesday         00:00-24:00
  wednesday       00:00-24:00
  thursday        00:00-24:00
  friday          00:00-24:00
  saturday        00:00-24:00
}

I'm using a python export to solve this problem. is very similar of the @cpmoore solution, but is a Python inside a docker container instead of a Java application.
Project here: https://github.com/allangood/holiday_exporter

I'm also interested about this feature in AM.

I'm also interested in this.

Can we guys add this as the priority for next release as we are all waiting for the 1st version of this feature? 🤔

I would like to see and discuss a design document first, as it is not as trivial as it sounds.

This is an open source project, there is no central priority list - only what individual people wish to work on.

Adding metoos has no impact on how quickly this is implemented (if it even ever will be implemented), they only clutter the issue. Proposing a design would however help, though keep in mind anything will only work in UTC and other solutions have become apparent above.

Is this really the best place to put in this feature? Most users get the notification on the phone where they can customize their own do not disturb times.

I'm now using a simplified solution based on the code that @roidelapluie shared, but I think a cleaner implementation wouldn't hurt.

@laktak I think it's the right place, yes. Not managing this in a centralized manner means the burden is on the users to create their own solutions. Right now I see people that ignore all alerts, others that don't ignore any alerts and both situations cause excessive stress.

I think @brian-brazil is right, this is an open source project and we'd be better discussing a proposed design instead. I only have experience with this feature in Nagios and I'm not sure we want to replicate that.

I have to agree with @brian-brazil and @gtirloni, IMHO, Alertmanager already has all the tools necessary to make this type of configuration. It doesn't support an out-of-the-box kind solution, but I don't think it is too hard to implement with the tools we have now. I've just created a docker exporter to help with this and everything I have to do is create an inhibition rule, what is not something hard to do. I even added a way to create custom holidays, for me it is much more flexible doing this outside Alermanager because I can implement what ever necessary without changing the code of a big project. Another advantage of doing this in a outside exporter is the flexibility to work with localtime and/or daylight saving time. On my project page I wrote an example on how to inhibit alerts when out of working hours taking care of daylight saving. Painless implementation, without changing Alertmanager code.

I think the "me toos" are just as unhelpful as the "we don't need this, you can just use this workaround." This issue is proposing an out of the box solution. I think the value is pretty obvious.

I don't think a health discussion with examples on how to do things are "unhelpful", but we can have different point of view. And in my opinion, the value is exactly the opposite as "pretty obvious".

If this feature was implemented inside Alertmanager, as @brian-brazil mentioned it would be implemented using UTC time. In this case, the feature wouldn't work as expected without an external "daylight_savings_exporter". And before someone says that is possible to use some expressions to solve this problem, some countries doesn't have fixed daylight saving dates and they are different every year. Look at prometheus/prometheus#4160. If an external work around would be necessary anyway, what is the point for this feature?

I don't think a health discussion with examples on how to do things are "unhelpful", but we can have different point of view. And in my opinion, the value is exactly the opposite as "pretty obvious".

@allangood Offering a different point of view is one thing; coming into an issue that people clearly want and countering everything with a reason it shouldn't exist is quite another. You're not just offering a helpful workaround, you're actively lobbying for this feature not to be built in. There are reasons why a built-in solution would be more desirable (e.g. maintenance cost, higher barrier to entry for new users, etc.). It's fine to think external exporters are better for your particular use cases and preferences, but not everyone shares your use cases or preferences. Coming into an issue for tracking a built-in solution and trying to tell everyone they just need to share your perspective is not helpful.

If this feature was implemented inside Alertmanager, as @brian-brazil mentioned it would be implemented using UTC time. In this case, the feature wouldn't work as expected without an external "daylight_savings_exporter". And before someone says that is possible to use some expressions to solve this problem, some countries doesn't have fixed daylight saving dates and they are different every year. Look at prometheus/prometheus#4160. If an external work around would be necessary anyway, what is the point for this feature?

Prometheus wouldn't be the first piece of software that needed to be timezone aware. You're talking like there aren't ways to convert local timestamps to UTC on the server side. Just because there are some cases where a solution isn't foolproof doesn't mean it couldn't work well for the many, many other users where daylight savings is fixed. This is extremely black and white thinking.

Here's a TZ/DST-aware pattern that we are using as a cron job:

#!/bin/sh

tool=/opt/alertmanager-0.15.2.linux-amd64/amtool
url=http://localhost:9093
user='crond@prod-monitor-01'
fmt='+%Y-%m-%dT%H:%M:%SZ'

tz=$(TZ='America/Los_Angeles' date -d '12:00:00 next thursday' '+%Z')

if [ "$tz" = PST ]; then
    start_time=20:00:00
    end_time=22:00:00
elif [ "$tz" = PDT ]; then
    start_time=19:00:00
    end_time=21:00:00
else
    exit 1  # "impossible"
fi

start="$(date -d "$start_time next thursday" -u $fmt)"
end="$(date -d "$end_time next thursday" -u $fmt)"

$tool --alertmanager.url=$url silence add -a "$user" --start=$start --end=$end \
 -c 'weekly maintenance Noon-2 Pacific' 'alertname=Our_Special_Service_Unresponsive'

exit $?

For what it's worth.

What about the possibility of allowing re-routing of alerts from the API? Silencing with amtool is easy to manage for time-zones, DST, and holidays using a general-purpose programming language and cron jobs. Adding re-routing of alerts via amtool would allow the same flexibility for those who want to re-route, not just silence.

To me, the most natural approach to solving this appears to be basically what @mpsmith has suggested, which would allow time-based matching inside the routing tree. Similar to match and match_re, if a node doesn't match based on its time rules we simply move on to sibling nodes or stop traversing.

This could be restricted to UTC, or we could leverage Go's time/location capabilities and translate the time ranges to UTC from a user-specified IANA timezone upon evaluation. I think this would provide the most user benefit, however I know timezones are a controversial subject in Prometheus/AM and I'm sure there's wider impacts I haven't considered.

Something like this:

    routes:
    - match:
         severity: critical
       match_times:
         - time_range: '0900-1700'
           days: ['monday', 'tuesday', 'wednesday', 'thursday', 'friday']         
           timezone: America/New_York
      receiver: team-X-pager

What are people's thoughts on this approach? I'd be keen on trying to implement it provided the direction is agreed upon.

One thing worth noting with my suggestion above is that it doesn't support the description of arbitrary time periods. Nagios, for example, has a far more descriptive time period construct that allows you to specify things like 'Tuesday' and '2nd Tuesday in December' as valid ranges. I'm not sure what the appetite for this would be though.

@benridley Some thoughts, without being an expert in the latest Alertmanager code base:

I wonder if this should really be a match parameter like you suggest or rather be a route setting that does not affect matching itself. Since you may or may not want the alert to continue matching against siblings if you are out of business hours, and then you'd need an extra setting for that which is separate from the currently already existing continue setting. How about just making it independent of the match behavior, but mute a route when it's out of its time range? Then I think the existing label matching + time-based muting + existing continue setting should be enough to model all use cases? E.g. you could still have two sibling routes that match the same alerts, but are muted at complementary times (one sending to pager, another to Slack), and with a continue: true in the first one, you would always match both, but only one would actually send.

My hunch is that this would also play together better with the alert state tracking (firing/resolved) in the different routes, as alerts wouldn't suddenly switch routes based on time of day, and a route is where that state is kept over time. So as a match-style parameter, a route would suddenly just lose all of its alerts and send resolved-notifications for them. We probably want this to behave more like inhibitions, where state is still tracked in the same routes as before, but only notifications are suppressed.

Given that the existing "integrations" of this feature use inhibition, 👍

I also think it'd make sense to have a new top-level section for time range definitions that are reusable from routes (like receiver definitions are). Then in your route you would reference their names, with some constructs to AND or OR multiple conditions together.

Thanks @juliusv, the idea of muting the route makes more sense especially given the life-cycle tracking that happens within a route.

I agree, the reusable interval definitions would be useful. I suppose the next question is what I alluded to above which is how comprehensive do we want an interval definition to be?

Viewing intervals from the perspective of a week and defining intervals on weekdays (similar to my snippet above) seems like it'd be the simplest implementation and cover most use cases. Something more expressive like the Nagios syntax above however would enable some interesting use cases like modelling maintenance windows and public holidays for example. I am wary of the complexity it introduces though.

@benridley Again, I'm more of a drive-by commenter here and haven't deeply thought about it yet, but: Since the actual calculation of time ranges would be a completely separate part of code that doesn't interact in complex ways with other pieces of Alertmanager (other than saying "yes, we're currently in this time range"), it seems fine to me to allow more complex time range definitions (á la Nagios) as well.

I suppose the next question is what I alluded to above which is how comprehensive do we want an interval definition to be?

TL;DR: Don't implement anything new, create a new docs page and ready-to-adapt, community-driven examples instead because everything we need is already there.

#kiss #yagni #minimalism #reuse #docs #community


(Note: Although I sometimes write only routing and/or inhibition, I usually mean both with respect to / focusing on time and different teams receiving alerts based on time.)

Having recently implemented a routing/inhibition solution based on that PromCon 2019 material, I must admit I couldn't be happier with it, given that I can even include public holidays from my country/state and company policies. (For anyone who hasn't seen it, I highly recommend to check it out.)

I could even add SysAdminDay, if I wanted to :-D

With these more involved inhibition rules, I have come to realize that, although it might be nice to have a $simple time of day and weekday based routing, it would not be sufficient for our policies and what we as a group of colleagues would want to express (office hours, on-call duty, being able to sleep, no alerts on holidays, etc.). I can only assume that (almost) everyone who is interested in time of day, day of week, etc. based inhibition/routing would agree that an overly simplistic, albeit more easily implementable solution would not be sufficient.

Given that a $not_so_simple routing/inhibition including stuff like country/state public holidays (or even company policies like holiday between Xmas and new years) is a bit too much to ask to implement (in an ad-hoc routing logic; and a never-ending pit of new feature requests), I feel myself being largely in favor of providing more extensive documentation and examples for inhibition rules rather than asking for an implementation of a full-fledged native routing configuration.


My proposal/RFC to not implement time of day based routing would be:

  1. Create a new examples directory structure and allow encourage contributors to throw in their (easily adaptable) configurations in a timezone like fashion, e.g. I would contribute mine as examples/business-hours/Europe/Germany/Saxony/time.rules.

  2. Create a dedicated Routing and Inhibition https://prometheus.io/docs/alerting/routing-and-inhibition/ docs page:

    • highlight the difference between $simple and $not_so_simple inhibition/routing (and that in the end you/your team/your boss would most likely end up wanting $not_so_simple anyway)
    • showcase how both routing and inhibition can be based on labels like time_window
    • link to the growing example configuration directory where users can easily grab/adapt/contribute their local timezone/holiday configs

What do you think?

@wookietreiber That's a valid way of seeing it, though I would still prefer to have first-class support for this. It would be nice not to have to configure anything on the Prometheus side at all, as conceptually this is a routing decision that should be configured only on the Alertmanager side (or even in the receiver tool afterwards, like a PagerDuty schedule). And a dedicated way of configuring this would be much more explicit and clear to the user, rather than having to explain how it's somehow accidentally possible to use alerting rules to pause other alert notifications based on the current time.

But I like the point that making the dedicated configuration too simplistic would be sad, because then a lot of people would still be using the alerting-rule-based variant and we'd have both ways around forever. If we go for something closer to what Nagios has (see e.g. https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/4/en/oncallrotation.html), then it looks like we should be able to cover most use cases?

@juliusv I do understand your reservations about both the accidental possibility issue and mixing of concerns between Prometheus and Alertmanager. I'd rather avoid these as well, however, given that Prometheus and Alertmanager are pretty much closely coupled anyway, it doesn't bother me too much.

If we go for something closer to what Nagios has

That's exactly my point: The Nagios configuration is ad-hoc and static. It doesn't allow for any kind of algorithm that defines e.g. when Easter is and you would always have (to remember) to modify it (although only once a year in case of these few varying public holidays). Personally, I'd rather define these kind of rules once and live happily ever after knowing I will never have to change them.

I am confident that the inhibition rules cover all necessary use cases, without the need to implement anything new. Feel free to prove me wrong :-D

It's hard to tell, how many use cases a configuration similar to that of Nagios covers. This largely depends on how fast you could implement how much of that functionality. Given that this will take some time, ATM the inhibition rules are the only way to do (all/most of) it and may possibly find the way into many Prometheus/Alertmanager users configurations. The question then becomes: can you express your current inhibition rules with the new native time-based routing configuration? Would you bother changing your configuration especially if it's already working with inhibition rules?

@wookietreiber Dynamic holidays etc. that you can calculate with PromQL, but that would be hard to express in a Nagios-style notation are a good point. OTOH in practice I would probably just hardcode Easter for the next 10 years and be done with it :) Also, there's probably holidays or other special cases that you would need to hardcode ad-hoc anyway, as they are not calculatable. Overall I see your point, but still leaning my direction :)

@wookietreiber This is how nagios solve the problem. https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/4/en/oncallrotation.html

But if you really want to stay with promql based rules. How it is about to improve the promql date functionality?
Like using some https://www.php.net/manual/en/datetime.formats.php
Here the used lib for that: https://github.com/derickr/timelib
But this would allow rules like:
now(Berlin) > strftime(09:00) > strftime(17:00)
or
today(Berlin) == strtime(eastern + 1day) for esster monday

Please mention in my example:

  • now() and today() accept a location for time zone detection, because only UTC is no enghouph alert supression. Given here a tiem zone would also not help because of summer/winter time /time zones.

Because such complex thinks like relative dates or timezone handling i dont like to implement by my own via promql rules

A question for the maintainers, how do we determine whether this feature should be implemented or not? It's clear that inhibition rules are working for people, but as we've seen some of us are in the camp that believe there's value in supporting this as a first-class feature. Seeing as this is now the second most commented issue (and most thumbed-up) after almost three years I think we should at least have a direction for this issue.

@GreenRover That's exactly what the easy to use/adapt example templates are for.

I adapted the one from here to Germany/Saxony and I basically only had to change the public holiday dates because incidentally the time zones matched. I now provide mine here and then it's more easily adaptable for other German states. When others share their templates for their time zones / countries / states it becomes even easier to use / adapt. #communityFTW

This way, you don't have to do the math or even think about it, I didn't even have to do it.

A question for the maintainers, how do we determine whether this feature should be implemented or not?

@simonpasquier as the current Alertmanager maintainer, any opionions on how to progress?

While the Prometheus expert and nerd in me likes the idea of solving everything through an existing flexible mechanism (PromQL+alerting rules+inhibits), we have to consider how this contraption seems to new and non-expert users. See e.g. #876 (comment) - I think a lot of them will be turned off even with more documentation and examples.

So I'm leaning towards thinking we should do it, with the exact time range specification language still to be determined.

@wookietreiber are you shure that your day light saving time rule is correckt for all years?
Please see this table: https://www.linker.ch/eigenlink/sommerzeit_winterzeit.htm

I agree, can we have a feature like this
routes:
- match:
severity: critical
match_times:
- time_range: '0900-1700'
days: ['monday', 'tuesday', 'wednesday', 'thursday', 'friday']
timezone: America/New_York
receiver: team-X-pager

@GreenRover I'm not, I just copied from the others and thought good enough. That's kinda off topic here, please discuss at the gist.

Thank you all for the good discussions! I was still on the fence as to whether this should be part of Alertmanager or something managed externally (e.g. with Prometheus recording rules). Eventually I feel that not having it in Alertmanager isn't friendly for beginners and newcomers. Especially since it's something that "legacy" monitoring tools offer...

Going forward, it would be good to have someone starting a document so that we can gather requirements and design proposal. I agree with Julius that route configuration should probably be extended to specify when the alert group is active/inactive wrt time. Having a way to declare time windows upfront and reference them in the routing tree is also interesting.

Thanks @simonpasquier. Is there a traditional route for design/implementation of features like this? Perhaps we should start a new issue focused on the design and additionally ask the mailing list for comments. I think @juliusv's comments would serve as a good starting point.

We generally work on google docs for design documents, and they are posted on the developers mailing list.

Thanks @roidelapluie,

I'm happy to get started on a design doc. I'll post back here with a link when I've drafted it and inform the mailing list.

One way I’ve always thought about implementing it, is in the same form as ansible’s when conditions
It’s a simple list of string templates (ansible uses Jinja, but they could just be go templates) that are rendered to return either true or false (or default to false if they return a non Boolean value)
I believe this allows for the most flexible configuration as users can specify as many conditions or nested conditions as they want for instance

‘’’
when: date.hour < 14 and date.day == 0
‘’’

Then it would be rendered using the following logic

{% if condition %}true{% else %}false{% endif %}

If it renders to true then the route applies

Sorry for the terrible formatting, I’m on my phone

Hi all, design doc is up and open for comments and suggestions here.

Assuming this is still in the planning stages?

I have another similar but different use case:

I want a rule to only start matching after some point in time X.

So that the rule looked like some_vector > 1234 AND time() > 1606780800.

This would effectively mute it til 2020-12-01T00:00:00Z.

I have another similar but different use case:

I want a rule to only start matching after some point in time X.

So that the rule looked like some_vector > 1234 AND time() > 1606780800.

This would effectively mute it til 2020-12-01T00:00:00Z.

Interesting use case

The time based silencing spec is in limbo at the moment. I'm still working on it occasionally but it's gone through a few iterations and finding consensus has proven difficult and it's important we get it right. The current design however would not support this use case, as its primarily designed to address silences thst need to occur periodically with known time intervals.

My instinct would be to separate the query and use a silence to mute the alert until the specified time but I'm not sure if that would meet your requirements

Just to update on this issue:
Something myself and some of the maintainers would like to see is the ability to specify a timezone in your time interval configuration. Currently, this is basically impossible because it breaks cross platform support as Go doesn't use the OS timezone database on Windows, and being able to build/run on Windows was pointed out as a key requirement.

There is an issue for this here. A resolution on this issue would mean that we'd be able to reliably convert timezones on Windows. Alternatively, the recently released Go 1.15 allows us to embed the timezone database in the binary, but this both inflates the binary size and would require updates of Alertmanager to get new timezone definitions, which isn't ideal. Personally I'd like to see this support in Go before we can make a feature that's really useful (or comparable to the inhibition based approach).

@benridley
I know it’s certainly not idea but what about something like using exec to call tzutil to get the time zone information, if alertmanager happens to be running on windows. Until the support is added to Go itself anyways.

@cpmoore Certainly a clever workaround, however tzutil would require timezones in Window's format rather than IANA format which Go and Linux use. At which point we'd need to keep a conversion table somewhere, and if we go that far we may as well implement the fix in upstream Go...

I think this is something that should be fixed upstream rather than us tackling it here or trying to work around it.

Perhaps an alternative would be to implement the feature with timezone support but leave it disabled. If people want to use it, they can use a build tag to build the application with tzdata bundled in and timezone support enabled. That would at least give people who want it an option, and make the transition easy when it's fixed in upstream.

Hi all, I've been chipping away at this. The feature is implemented in my development branch and supports the features outlined in the design document, so feel free to explore and give feedback. I created a small library to implement the syntax for the time intervals, but I plan to move that into the project rather than add a dependency.

It's currently failing some acceptance tests so I will work on that next. It also likely needs some sort of UI feedback for when routes are muted, as the current inhibition/silence mechanisms don't illustrate that.

@benridley great job! One thing can be added is the timezone management like so:

 - name: business_hours
   intervals:
     - days: ['monday:friday']
       times: ['09:00-17:00']
       timezone: UTC

That will be helpful to manage summer time and winter time in Europe :). If not specified, the default timezone will be UTC and can be override.

Timezones are not supported in go on Windows.

Alertmanager uses Golang 1.14, but 1.15 has an option to embed the timezone data - does that work on Windows? See https://golang.org/doc/go1.15#time/tzdata

Not really, it was out of date when it was added and hasn't been updated since. Even if was promptly updated, you could still be easily talking a year for an update to propagate out given Go and AM release cycles which is far too long - ignoring all the other problems with embedding data such as being forced to upgrade.

Yeah proper timezone support will have to wait until Go parses the OS provided timezone files on Windows. There's an open issue for this in Go, so hopefully there's progress soon. We can always add the feature relatively easily as soon as support is added.

But is not having any TZ support still better than no support whatsoever?
Timezones don't change frequently in majority of regions.

But is not having any TZ support still better than no support whatsoever?
Timezones don't change frequently majority of regions.

Il we go out today it will be out of sync for europe in about 6 months

@roidelapluie is a change in timezones for the whole Europe scheduled next April?

@roidelapluie is a change in timezones for the whole Europe scheduled next April?

It seems like they moved it to 2022, but yes, it should be the end of DST here.

There's countries where you often get zero notice of a change, and more generally timezone changes happen more frequently than you'd think. Canada is in the middle of one for example (the relevant law hadn't passed yet, but they were planning on it last I looked).

Pull request here: #2393

Glad to see there's a PR open for this feature although it seems it's only for muting alerts between specific time periods. Are there any updates with regards to allowing for different alerting routes depending on the date of week and/or time? Something like was described in this comment would be great.

Hi @hartfordfive, this was initially discussed in the design draft, but it was determined to be too problematic to change routing between time periods because there's a lot of important behaviour tied to routes, for example when time intervals change should a flurry of new alerts fire to the newly active route? Should a flurry of resolved alerts be sent to the old one?

For this reason, routing remains static and muting is applied to them in the above design. However you should be able to achieve most of the same outcomes by muting routes and using the continue option to specify alternate paths for the same alert.

commented

Hi,

I was wondering if anyone can help me. So i followed this post and from the design document, if I understood correctly i set the below in alertmanager.yml
mute_time_intervals:

  • name: business_hours
    time_intervals:
    • weekdays: ['monday:friday']
      times:
      • start_time: "09:00"
        end_time: "17:00"

I started with this to test, however every time i am restarting alertmanager, I am getting the below error:

msg="Loading configuration file failed" file=/etc/alertmanager/alertmanager.yml err="yaml: unmarshal errors:\n line 1: field mute_time_intervals not found in type config.plain

Am i missing something?
Following this doc : https://docs.google.com/document/d/1pf-rPDQUGJUHazyr5vanTO6ft3loNZO9UoVpvhShFtA/edit

commented

Hi @justin27c,
this feature is not part of any official release, you need to compile alertmanager yourself with PR #2393, but I don't know about the stability.

hours 0-6 silence:

# default utc time
# my cst time(+8)
node_load5 > 8 and ON() (hour() < 16 or hour() > 22)

Implemented in #2393

Hi all.
I'm trying to use that feature with release version 0.22.0-rc.1, but I have no luck with it.
Here is my alertmanager.yml:

#
# Ansible managed
#

global:
  resolve_timeout: 3m

mute_time_intervals:
  - name: business_hours
    time_intervals:
      - weekdays: ['monday:friday']
        times:
        - start_time: "09:00"
          end_time: "11:00"

templates:
- '/etc/alertmanager/templates/*.tmpl'
receivers:
- name: pagerduty
  pagerduty_configs:
  - client_url: http://х.х.х.х:9093/
    description: '{{ if .CommonAnnotations.summary }}{{ .CommonAnnotations.summary
      }}{{ end }}'
    routing_key: ххх
    severity: '{{ if .CommonLabels.severity }}{{ .CommonLabels.severity | toLower
      }}{{ end }}'

route:
  group_by:
  - alertname
  - cluster
  - service
  - env
  group_interval: 5m
  group_wait: 30s
  receiver: pagerduty
  repeat_interval: 4h
  routes:
  - group_wait: 10s
    match:
      severity: loww
      time_intervals:
        - business_hours
      receiver: pagerduty

But with that config alertmanager don't want to start.
Part of log:

level=error ts=2021-05-14T13:38:06.667Z caller=coordinator.go:118 component=configuration msg="Loading configuration file failed" file=/etc/alertmanager/alertmanager.yml err="yaml: unmarshal errors:\n  line 44: cannot unmarshal !!seq into string"

Can anyone help me?
Thanks

commented

@bimmerkiev - can you paste the whole log file to check which is line 44?

@justin27c sure. fixed.
It's about
" - business_hours"

commented

Can you change the below?

     time_intervals:
        - business_hours
      receiver: pagerduty
to 
mute_time_intervals:
     - business_hours
  receiver: pagerduty

Can you change the below?

     time_intervals:
        - business_hours
      receiver: pagerduty
to 
mute_time_intervals:
     - business_hours
  receiver: pagerduty

Already tried that - no luck. The same error

If I try this:

  routes:
  - group_wait: 10s
    match:
      severity: loww
    mute_time_intervals:
        - business_hours
      receiver: pagerduty

I'm receiving:

level=error ts=2021-05-14T13:59:24.090Z caller=coordinator.go:118 component=configuration msg="Loading configuration file failed" file=/etc/alertmanager/alertmanager.yml err="yaml: line 43: did not find expected key"

If like this:

  routes:
  - group_wait: 10s
    match:
      severity: loww
      mute_time_intervals:
        - business_hours
      receiver: pagerduty

The result:

level=error ts=2021-05-14T14:00:06.573Z caller=coordinator.go:118 component=configuration msg="Loading configuration file failed" file=/etc/alertmanager/alertmanager.yml err="yaml: unmarshal errors:\n  line 43: cannot unmarshal !!seq into string"
commented

I have the below config and working fine:

mute_time_intervals:
  - name: business_hours
    time_intervals:
      - weekdays: ['monday':'friday']
        times:
        - start_time: '18:00'
          end_time: '22:00'
		  

route:
  group_by: ['alertname']
  group_wait: 30s
  routes:
  - match:
      severity: low
    receiver: team-receiver
    mute_time_intervals:
      - business_hours

Hi @bimmerkiev, you were almost there with this version:

  routes:
  - group_wait: 10s
    match:
      severity: loww
    mute_time_intervals:
        - business_hours
      receiver: pagerduty

The problem is that the receiver was indented, so Alertmanager was getting confused because it should be at the top level alongside mute_time_intervals, group wait etc. This config should be ok:

  routes:
  - group_wait: 10s
    match:
      severity: loww
    mute_time_intervals:
      - business_hours
    receiver: pagerduty

Hi @bimmerkiev, you were almost there with this version:

  routes:
  - group_wait: 10s
    match:
      severity: loww
    mute_time_intervals:
        - business_hours
      receiver: pagerduty

The problem is that the receiver was indented, so Alertmanager was getting confused because it should be at the top level alongside mute_time_intervals, group wait etc. This config should be ok:

  routes:
  - group_wait: 10s
    match:
      severity: loww
    mute_time_intervals:
      - business_hours
    receiver: pagerduty

I'm really appreciate for your help.
It works now

Thank you all for your ideas, this is the expression I came up with in order to trigger the alert only during working hours on week days: vector(1) and on() (day_of_week() > 0 and day_of_week() < 6) and on() (hour() > 8 and hour() < 18) where 'vector(1)' is the query for the metric I'm using.
Hope it's useful to you, best regards