Code claim ratio anomaly detection should factor in overall issue volume

Question

Code claim ratio anomaly detection should factor in overall issue volume

bschlaman opened this issue 3 years ago · comments

TL;DR

Currently, per-realm UTC day code claim ratios are considered anomalous if they stray >=2 standard deviations from a rolling 14 day mean. Realms with low overall issue rates have a naturally greater degree of statistical uncertainty in their daily claim ratios, and are likely to trigger false positives in the codes_claimed_ratio_anomaly metric.

Additionally, I think it's sensible to omit the default e2e test realm from this metric, as it has a comparatively high issue rate, but will trigger the alert typically with even a single failure, as near 100% success is expected. While it is important to be alerted on this, this behavior is a duplicate of the ForwardProgress-e2e-default alert.

Design

Proposal
I propose there be an inverse relationship between the average code issue rate of a realm and how many consecutive anomaly days are required to trigger the alert. (Perhaps the scalar constant can be a configurable value? Or set dynamically based on the average realm issue rate across the server?) To simplify, maybe each realm would fall into 1 of 2 categories - high and low issue realms.

The downsides here would be a potential delay in alerting low-issue realms of an outage.

Alternatives considered
Alternatively, you could consider creating an inverse relationship between average code issue rate and number of standard deviations required to increment the metric; e.g. for a low-issue realm, it may require 3 standard deviations instead of the default 2. This is probably much easier to code but may not eliminate all false-positives.

Brendan Schlaman · Answer 1 · Tue Nov 16 2021 03:41:57 GMT+0800 (China Standard Time)

@sethvargo

Seth Vargo · Answer 2 · Tue Nov 16 2021 07:19:51 GMT+0800 (China Standard Time)

Additionally, I think it's sensible to omit the default e2e test realm from this metric, as it has a comparatively high issue rate, but will trigger the alert typically with even a single failure, as near 100% success is expected. While it is important to be alerted on this, this behavior is a duplicate of the ForwardProgress-e2e-default alert.

This is done and will be included in the next release (#2284).

Seth Vargo · Answer 3 · Tue Nov 16 2021 07:28:31 GMT+0800 (China Standard Time)

Two other alternatives (open to ideas):

Allow per-realm configurable thresholds. This adds additional overhead to realm admins to configure their desired threshold (much like the abuse factor), but it also gives realms that want to be alerted on the slightest anomalies the ability to configure that as such.
Change the calculation such that it's over total codes claimed / codes_issued for the 14-day window (instead of per day). Right now, the calculation is:
```
Mean = Sum(ratio.day1, ratio.day2, ratio.day3, ...) / N
```
Where ratio.dayN is the codes_claimed/codes_issued for that UTC date. If, instead, we calculated it as:
```
Mean = [SUM(codes_claimed.day1, codes_claimed.day2, ...) / SUM(codes_issued.day1, codes_issued.day2)]/N
```
then there's far less variance, but also far less likely to alert. We could also meet somewhere in the middle with:
```
Mean = Mean(codes_claimed.day1, codes_claimed.day2, ...) / Mean(codes_issued.day1, codes_issued.day2, ...)
```

What would be super helpful is to have real-world data against which to run these models and see whether they would alert (and whether we'd want them to alert).

Mike Helmick · Answer 4 · Tue Nov 16 2021 09:11:26 GMT+0800 (China Standard Time)

@sethvargo and I had already discussed making this per-realm configurable. I think we should go that way - including allowing a realm to opt-out of these notifications (w/ an appropriate big red warning).

I think the problem w/ the current formula is that it treats all days as equal, and with the way days are cut on UTC, I think we will get weird behaviour on the weekends.

Agree w/ the previous comment that we should run some modeling on real world data and see if we can come up w/ a more appropriate formula.

Brendan Schlaman · Answer 5 · Tue Nov 16 2021 22:12:13 GMT+0800 (China Standard Time)

1. Allow per-realm configurable thresholds. This adds additional overhead to realm admins to configure their desired threshold (much like the abuse factor), but it also gives realms that want to be alerted on the slightest anomalies the ability to configure that as such.

I think the main problem with this is that the realms which are most likely to trigger a false positive are also the ones who are least likely to take on this overhead :)

Brendan Schlaman · Answer 6 · Tue Nov 16 2021 22:18:23 GMT+0800 (China Standard Time)

I can ask around if there are any jurisdictions who might be willing to share data to help with the modeling if we aren't going the per-realm route

Seth Vargo · Answer 7 · Tue Nov 16 2021 22:33:05 GMT+0800 (China Standard Time)

@bschlaman even if you could share anonymous data from a few realms, that would be helpful. If you uniformly mutate the data (e.g. multiply all values by 12.74), we could use that to build a better model and see how the model would/wouldn't alert on that data.

Brendan Schlaman · Answer 8 · Tue Nov 16 2021 22:46:45 GMT+0800 (China Standard Time)

Ah great point - I guess this counts as homomorphic encryption? :) Sounds good to me, I'll see what I can come up with

Seth Vargo · Answer 9 · Fri Nov 19 2021 23:38:22 GMT+0800 (China Standard Time)

Chatted over email - the new value of 2 stdev seems like a better choice, but we'll continue to monitor and adjust as needed.