medic / cht-watchdog

Configuration for deploying a monitoring/alerting stack for CHT

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Gather metrics and configure alerts to find out when installed CHT version is out of date

garethbowen opened this issue · comments

We have an OGSM metric targeting reducing the number of instances that are running versions of the Core Framework which is no longer supported according to the CHT support matrix. It is possible that one reason people don't upgrade is they don't know that an upgrade is available. Other than loading the Admin > Upgrade page, or watching the Forum, it's not that easy to find out.

A couple of measures I can think of are...

  • Alert when a new service pack is release for the current major + minor - ie: your version has bugs, fix these by upgrading today!
  • Metric and alert when the currently installed version is no longer supported.

Both of these will need a new data ingress point, either to the market, or the docs site somehow.

This is an interesting idea - thanks for the submission @garethbowen!

tl;dr - We currently have an issue of too much alert noise right now and not enough signal. I don't think we should alert on this (yet?). Let's see how we can possibly highlight it in the short term and re-consider in the long term


We've been in the process auditing (see #35) the existing alerts (list is here, tl;dr - there's 9 currently (which count one an un-created one and one will likely be removed)). To get a better sense of how important & actionable the alerts are, we enabled all 9 alerts on the 30+ production CHT instances that Medic hosts. So far, of all the alerts we've gotten, truth be told - only two or three of them are actionable.

Also, we already have a feed of the releases we post to the forum which show up in this handy panel for every Watchdog instances:

image

Maybe this is enought?

But, short term, no new alerts. Only alerts that are going to cause outages. And only ones that are actionable. At a later date, when we have a much better signal to noise ration, we might consider adding a panel that shows both how many versions your current one is behind current, and how many versions behind yours is from not being EOL.