medic / cht-watchdog

Configuration for deploying a monitoring/alerting stack for CHT

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Expand base set of alert metrics and determine priority levels

eljhkrr opened this issue · comments

Existing instance alert rules have been compiled here: https://docs.google.com/spreadsheets/d/1-sq1Bfz-8i3TyVn9rNcYzJKxcy4wreySUC3YAQH3nkA/edit#gid=0
These rules were built from audit data analysis in #35:
https://docs.google.com/spreadsheets/d/1ZAHqPidHckvfQUoGdcE2AlPjiduC3A0kyUDNCBYTltw/edit#gid=0
To make the alert system more usable, alert rules need to be reviewed against current deployment needs

@eljhkrr - we're now adding couch2pg backlog - is this a good place to capture the need to alert on that value? Current thinking is if it increases over a 24 hour period we should alert. Couch2pg runs every 6 hours on most medic prod instances

Thanks @mrjones-plip, this is the right place for alert metric proposals. I've added it to the document to make it easier to review as a batch.

More alert proposals for p0-p3 metrics added to the doc for consideration

Thanks for the update @eljhkrr !

Any alerts that are dependent on content from ingest data repo should be added there instead of in watchdog. We can do another bind mount of the alert config like we do for watchdog.

Thanks @mrjones-plip, will keep in mind when making updates