medic / cht-watchdog

Configuration for deploying a monitoring/alerting stack for CHT

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Change alerts to percentage increase: "Outbound Push Backlog" and "Users Over Replication Limit"

mrjones-plip opened this issue ยท comments

Right now Outbound Push Backlog and Users Over Replication Limit are tied to a fixed integer, say "is greater than 500". This is not helpful. For example, the bulk of the Outbound alerts are from one deployment has had a 0% change of the 214 docs in outbound push over the past 30 days, so we shouldn't alert.

Instead we should change these to be when the amount is increases over X% across Y days (research to see what values are logical.)

Link to sample alert cited above

Just noting here that we should be able to use the deriv function on these gauge metrics to calculate the amount of change per sec for the metrics over a specified range of time. This would allow us to alert when these metric values climb rapidly and do not fall. However, it will not alert if the metric value stays at a constant high value for an extended period of time. (So basically alerts would be triggered by only changes to a metric value.)

I am going to do some more research into which alerts should be updated for this (besides just Outbound Push Backlog and Users Over Replication Limit) as well as which values we should alert on...

Another thing I have been considering here is if we need to include another dimension into the calculation to account for drastic size differences between instances. So, maybe the threshold for the amount of change per sec that should trigger the alert should be some value based on the size/activity-level of the instance. For example, an instance with 20,000 users probably needs to alert at a higher rate of Users Over Replication Limit than an instance with 20 users.

Just want to record here my thinking for algorithms for the various alerts. Basically I am going to as simple as possible while still focusing on rate of change and not just simple levels. The threshold for triggering the alerts is also based on data from the instance so the same alert rule should work for both large and small instances. Finally, all the alert thresholds include a numerical buffer value to eliminate noise coming from very small instances. (For example, if an instance only has 5 users and 1 of them goes over the replication limit, that is 20% of the users, but probably not worth alerting on...)

  • Sentinel Backlog:
    • backlog_hr = deriv(cht_sentinel_backlog_count [1h]) * 60 * 60
      • Number of docs added to backlog in last hour
    • changes_hr = rate(cht_couchdb_update_sequence{db="medic"}[30d]) * 60 * 60
      • Average rate of changes to medic db per hour (over last month)
    • Alert = $backlog_hr > ($changes_hr + 500)
      • Alert if the backlog is growing at the same rate as changes typically are coming in (+ 500). This indicates that basically nothing is getting processed by Sentinel. (The 500 is just there as a minimum buffer.)
    • Explore Query = (deriv(cht_sentinel_backlog_count [1h]) * 60 * 60) > on(instance) (rate(cht_couchdb_update_sequence{db="medic"}[30d]) * 60 * 60 + 500)
  • Outbound Push Backlog:
    • changes_hr = rate(cht_couchdb_update_sequence{db="medic"}[30d]) * 60 * 60
      • Average rate of changes to medic db per hour (over last month)
    • backlog_hr = deriv(cht_outbound_push_backlog_count [1h]) * 60 * 60
    • Alert = $backlog_hr > ($changes_hr * 0.05 + 5)
      • Alert if backlog is increasing by more than 5% of the the changes rate (with a 5 change minimum buffer).
    • Explore Query = (deriv(cht_outbound_push_backlog_count [1h]) * 60 * 60) > on(instance) ((rate(cht_couchdb_update_sequence{db="medic"}[30d]) * 60 * 60) * 0.05 + 5)
  • Users Over Replication Limit:
    • users_day = deriv(cht_replication_limit_count [1d]) * 60 * 60 * 24
    • users = cht_connected_users_count
    • Alert = $users_day > (cht_connected_users_count * 0.003 + 2)
      • Alert if user count is increasing by more than 0.3% of total users ( + 2) per day
    • Explore Query = (deriv(cht_replication_limit_count [1d]) * 60 * 60 * 24) > on(instance) (cht_connected_users_count * 0.003 + 2)
  • DB Conflicts Rate
    • conflicts_hr = deriv(cht_conflict_count [1h]) * 60 * 60
    • changes_hr = rate(cht_couchdb_update_sequence{db="medic"}[30d]) * 60 * 60
    • Alert = $conflicts_hr > ($changes_hr * 0.25 + 10)
    • Explore Query = (deriv(cht_conflict_count [1h]) * 60 * 60) > on (instance) ((rate(cht_couchdb_update_sequence{db="medic"}[30d]) * 60 * 60) * 0.25 + 10)

Love it - thanks for the update @jkuester !

๐ŸŽ‰ This issue has been resolved in version 1.8.0 ๐ŸŽ‰

The release is available on GitHub release

Your semantic-release bot ๐Ÿ“ฆ๐Ÿš€