Expand base set of alert metrics and determine priority levels

Question

Expand base set of alert metrics and determine priority levels

eljhkrr opened this issue 3 months ago · comments

Existing instance alert rules have been compiled here: https://docs.google.com/spreadsheets/d/1-sq1Bfz-8i3TyVn9rNcYzJKxcy4wreySUC3YAQH3nkA/edit#gid=0
These rules were built from audit data analysis in #35:
https://docs.google.com/spreadsheets/d/1ZAHqPidHckvfQUoGdcE2AlPjiduC3A0kyUDNCBYTltw/edit#gid=0
To make the alert system more usable, alert rules need to be reviewed against current deployment needs

mrjones · Answer 1 · Fri Apr 26 2024 22:41:41 GMT+0800 (China Standard Time)

@eljhkrr - we're now adding couch2pg backlog - is this a good place to capture the need to alert on that value? Current thinking is if it increases over a 24 hour period we should alert. Couch2pg runs every 6 hours on most medic prod instances

Elijah Karari · Answer 2 · Sat Apr 27 2024 05:03:38 GMT+0800 (China Standard Time)

Thanks @mrjones-plip, this is the right place for alert metric proposals. I've added it to the document to make it easier to review as a batch.

Elijah Karari · Answer 3 · Sat May 11 2024 04:18:10 GMT+0800 (China Standard Time)

More alert proposals for p0-p3 metrics added to the doc for consideration

mrjones · Answer 4 · Sat May 11 2024 06:07:19 GMT+0800 (China Standard Time)

Thanks for the update @eljhkrr !

Any alerts that are dependent on content from ingest data repo should be added there instead of in watchdog. We can do another bind mount of the alert config like we do for watchdog.

Elijah Karari · Answer 5 · Sat May 11 2024 06:21:12 GMT+0800 (China Standard Time)

Thanks @mrjones-plip, will keep in mind when making updates