Change alerts to percentage increase: "Outbound Push Backlog" and "Users Over Replication Limit"
mrjones-plip opened this issue ยท comments
Right now Outbound Push Backlog
and Users Over Replication Limit
are tied to a fixed integer, say "is greater than 500". This is not helpful. For example, the bulk of the Outbound alerts are from one deployment has had a 0% change of the 214 docs in outbound push over the past 30 days, so we shouldn't alert.
Instead we should change these to be when the amount is increases over X% across Y days (research to see what values are logical.)
Link to sample alert cited above
Just noting here that we should be able to use the deriv
function on these gauge metrics to calculate the amount of change per sec for the metrics over a specified range of time. This would allow us to alert when these metric values climb rapidly and do not fall. However, it will not alert if the metric value stays at a constant high value for an extended period of time. (So basically alerts would be triggered by only changes to a metric value.)
I am going to do some more research into which alerts should be updated for this (besides just Outbound Push Backlog
and Users Over Replication Limit
) as well as which values we should alert on...
Another thing I have been considering here is if we need to include another dimension into the calculation to account for drastic size differences between instances. So, maybe the threshold for the amount of change per sec that should trigger the alert should be some value based on the size/activity-level of the instance. For example, an instance with 20,000 users probably needs to alert at a higher rate of Users Over Replication Limit than an instance with 20 users.
Just want to record here my thinking for algorithms for the various alerts. Basically I am going to as simple as possible while still focusing on rate of change and not just simple levels. The threshold for triggering the alerts is also based on data from the instance so the same alert rule should work for both large and small instances. Finally, all the alert thresholds include a numerical buffer value to eliminate noise coming from very small instances. (For example, if an instance only has 5 users and 1 of them goes over the replication limit, that is 20% of the users, but probably not worth alerting on...)
- Sentinel Backlog:
- backlog_hr =
deriv(cht_sentinel_backlog_count [1h]) * 60 * 60
- Number of docs added to backlog in last hour
- changes_hr =
rate(cht_couchdb_update_sequence{db="medic"}[30d]) * 60 * 60
- Average rate of changes to
medic
db per hour (over last month)
- Average rate of changes to
- Alert =
$backlog_hr > ($changes_hr + 500)
- Alert if the backlog is growing at the same rate as changes typically are coming in (
+ 500
). This indicates that basically nothing is getting processed by Sentinel. (The500
is just there as a minimum buffer.)
- Alert if the backlog is growing at the same rate as changes typically are coming in (
- Explore Query =
(deriv(cht_sentinel_backlog_count [1h]) * 60 * 60) > on(instance) (rate(cht_couchdb_update_sequence{db="medic"}[30d]) * 60 * 60 + 500)
- backlog_hr =
- Outbound Push Backlog:
- changes_hr =
rate(cht_couchdb_update_sequence{db="medic"}[30d]) * 60 * 60
- Average rate of changes to
medic
db per hour (over last month)
- Average rate of changes to
- backlog_hr =
deriv(cht_outbound_push_backlog_count [1h]) * 60 * 60
- Alert =
$backlog_hr > ($changes_hr * 0.05 + 5)
- Alert if backlog is increasing by more than 5% of the the changes rate (with a
5
change minimum buffer).
- Alert if backlog is increasing by more than 5% of the the changes rate (with a
- Explore Query =
(deriv(cht_outbound_push_backlog_count [1h]) * 60 * 60) > on(instance) ((rate(cht_couchdb_update_sequence{db="medic"}[30d]) * 60 * 60) * 0.05 + 5)
- changes_hr =
- Users Over Replication Limit:
- users_day =
deriv(cht_replication_limit_count [1d]) * 60 * 60 * 24
- users =
cht_connected_users_count
- Alert =
$users_day > (cht_connected_users_count * 0.003 + 2)
- Alert if user count is increasing by more than 0.3% of total users (
+ 2
) per day
- Alert if user count is increasing by more than 0.3% of total users (
- Explore Query =
(deriv(cht_replication_limit_count [1d]) * 60 * 60 * 24) > on(instance) (cht_connected_users_count * 0.003 + 2)
- users_day =
- DB Conflicts Rate
- conflicts_hr =
deriv(cht_conflict_count [1h]) * 60 * 60
- changes_hr =
rate(cht_couchdb_update_sequence{db="medic"}[30d]) * 60 * 60
- Alert =
$conflicts_hr > ($changes_hr * 0.25 + 10)
- Explore Query =
(deriv(cht_conflict_count [1h]) * 60 * 60) > on (instance) ((rate(cht_couchdb_update_sequence{db="medic"}[30d]) * 60 * 60) * 0.25 + 10)
- conflicts_hr =
๐ This issue has been resolved in version 1.8.0 ๐
The release is available on GitHub release
Your semantic-release bot ๐ฆ๐