KPI calculated even if too little data supplied
arurke opened this issue · comments
This might be an issue in TriScale, or me misunderstanding a use-case.
TL;DR: analysis_kpi()
returns a valid value when too few data-points supplied - if the "unintuitive" bound is selected (upper for percentile < 50, and vice versa).
Background: The intuitive way to calculate a KPI is to specify a bound which gives us the "worst case" (upper when percentile > 50, and vice versa). This allows us to make the "performance is at least X"-statements. However, I was thinking there was information in the other bound as well. This would show the width of the CI, and we could learn if the given metric varies a lot between runs. The first example coming to mind is industrial scenarios, where not only the maximum latency is interesting, but also its variability.
With this background I was routinely calling analysis_kpi()
twice, once with bound set to upper and another with lower. Doing this I noticed I would be getting a valid value when the "unintuitive" bound was selected (upper for percentile < 50, and vice versa), even if I had too little data.
Example with too little data:
import triscale as triscale
import numpy as np
data = np.random.randint(0, 10, size=(5))
settings = {"bound": "lower", "percentile": 99,
"confidence": 95, "bounds": [min(data), max(data)]}
independent, kpi = triscale.analysis_kpi(
data,
settings,
verbose=False)
print("KPI: " + str(kpi))
With bound set to "upper", the KPI correctly returns NaN. With bound set to "lower", a number is returned.
Sorry for the delay! I have been quite busy recently.
TL;DR: Everything works as expected AFAIU
The number of data points you need depends on the bound (upper/lower) that you pick, for a given percentile (except for the median, of course). To understand why let's keep your example. P=99, C=95.
- When we compute the "upper" CI for that percentile, what we are actually doing is checking whether we have one data point that has at least 95% probability to be larger than the 99th percentile. This requires more than 5 samples, so the method returns NaN.
- If one computes the "lower" CI for the same percentile, we check whether we have one data point that has at least 95% probability to be smaller than the 99th percentile. And that's easy because most samples (99% of them) are expected to be smaller than the 99th percentile. So one needs very few samples.
Side-note
If you are interested in the variability of a given KPI, you might want to look at thetwo-sided
option. In short, it spares you the calling of the method twice (plus, you are sure that you have 95% confidence of the percentile to be between the two bounds returned).
Thanks a lot for a very detailed and enlightening explanation. It makes very much sense. I tunnel-visioned, assuming they had the same requirements. Regarding the side-note: You mean in analysis_kpi()
? It forces one-sided as per master now. But I do see there seems to be support for it in ThompsonCI()
- is this ready to be utilized?
Sorry for the delay! I have been quite busy recently.
No need to apologize, I am grateful for you taking the time!
Ah yes, you're right. You'll need to go back to the ThompsonCI()
function to get access to the two-sided option (or you just overwrite the TriScale function to allow that option).
The two-sided option is reliable. JSYK, I've opened a PR ages ago to include this ThompsonCI()
function into scipy but never got around to finish it... which is a shame but you know... life. :-/
I looked and played a bit with the two-sided option. I modified analysis_kpi()
to basically call ThompsonCI()
directly, and give me the lower- and upper-bound it calculates. I then call it with 1000 data-points, and varying the class and percentile, example:
data = np.random.randint(1,10,size=(1000))
settings = {"bound": "lower", "percentile": 90,
"confidence": 95, "bounds": [min(data), max(data)],
"class":"two-sided"}
The lower- and upper-bounds I get is as follows:
- "one-sided":
- 90p: 883 - 915. # With 95c, the true 90p is between index 883 and 915.
- 10p: 84 - 116. # With 95c, the true 10p is between index 84 and 116.
- "two-sided":
- 90p: 84 - 915. # With 95c, 90 % of the data is between index 84 and 915
- 10p: 84 - 915. # With 95c, 10 % of the data is between index 84 and 915?!
I am struggling to combine my understanding of CIs and "bounds", the terms in Triscale, and the data I am seeing. I was conflicted, so I added some statements behinds the bounds - I was hoping I could ask you to comment, clarify, confirm?