KPI calculated even if too little data supplied

Question

KPI calculated even if too little data supplied

arurke opened this issue 2 years ago · comments

This might be an issue in TriScale, or me misunderstanding a use-case.

TL;DR: analysis_kpi() returns a valid value when too few data-points supplied - if the "unintuitive" bound is selected (upper for percentile < 50, and vice versa).

Background: The intuitive way to calculate a KPI is to specify a bound which gives us the "worst case" (upper when percentile > 50, and vice versa). This allows us to make the "performance is at least X"-statements. However, I was thinking there was information in the other bound as well. This would show the width of the CI, and we could learn if the given metric varies a lot between runs. The first example coming to mind is industrial scenarios, where not only the maximum latency is interesting, but also its variability.

With this background I was routinely calling analysis_kpi() twice, once with bound set to upper and another with lower. Doing this I noticed I would be getting a valid value when the "unintuitive" bound was selected (upper for percentile < 50, and vice versa), even if I had too little data.

Example with too little data:

import triscale as triscale
import numpy as np

data = np.random.randint(0, 10, size=(5))

settings = {"bound": "lower", "percentile": 99,
            "confidence": 95, "bounds": [min(data), max(data)]}

independent, kpi = triscale.analysis_kpi(
                    data,
                    settings,
                    verbose=False)
print("KPI: " + str(kpi))

With bound set to "upper", the KPI correctly returns NaN. With bound set to "lower", a number is returned.

Romain Jacob · Answer 1 · Wed Jun 08 2022 19:57:40 GMT+0800 (China Standard Time)

Sorry for the delay! I have been quite busy recently.

TL;DR: Everything works as expected AFAIU

The number of data points you need depends on the bound (upper/lower) that you pick, for a given percentile (except for the median, of course). To understand why let's keep your example. P=99, C=95.

When we compute the "upper" CI for that percentile, what we are actually doing is checking whether we have one data point that has at least 95% probability to be larger than the 99th percentile. This requires more than 5 samples, so the method returns NaN.
If one computes the "lower" CI for the same percentile, we check whether we have one data point that has at least 95% probability to be smaller than the 99th percentile. And that's easy because most samples (99% of them) are expected to be smaller than the 99th percentile. So one needs very few samples.

Side-note
If you are interested in the variability of a given KPI, you might want to look at the two-sided option. In short, it spares you the calling of the method twice (plus, you are sure that you have 95% confidence of the percentile to be between the two bounds returned).

Andreas Urke · Answer 2 · Thu Jun 09 2022 06:31:24 GMT+0800 (China Standard Time)

Thanks a lot for a very detailed and enlightening explanation. It makes very much sense. I tunnel-visioned, assuming they had the same requirements. Regarding the side-note: You mean in analysis_kpi()? It forces one-sided as per master now. But I do see there seems to be support for it in ThompsonCI() - is this ready to be utilized?

Sorry for the delay! I have been quite busy recently.

No need to apologize, I am grateful for you taking the time!

Romain Jacob · Answer 3 · Thu Jun 09 2022 18:51:22 GMT+0800 (China Standard Time)

Ah yes, you're right. You'll need to go back to the ThompsonCI() function to get access to the two-sided option (or you just overwrite the TriScale function to allow that option).

The two-sided option is reliable. JSYK, I've opened a PR ages ago to include this ThompsonCI() function into scipy but never got around to finish it... which is a shame but you know... life. :-/

Andreas Urke · Answer 4 · Sun Jun 12 2022 23:11:12 GMT+0800 (China Standard Time)

I looked and played a bit with the two-sided option. I modified analysis_kpi() to basically call ThompsonCI() directly, and give me the lower- and upper-bound it calculates. I then call it with 1000 data-points, and varying the class and percentile, example:

data = np.random.randint(1,10,size=(1000))
settings = {"bound": "lower", "percentile": 90,
            "confidence": 95, "bounds": [min(data), max(data)],
            "class":"two-sided"}

The lower- and upper-bounds I get is as follows:

"one-sided":
- 90p: 883 - 915. # With 95c, the true 90p is between index 883 and 915.
- 10p: 84 - 116. # With 95c, the true 10p is between index 84 and 116.
"two-sided":
- 90p: 84 - 915. # With 95c, 90 % of the data is between index 84 and 915
- 10p: 84 - 915. # With 95c, 10 % of the data is between index 84 and 915?!

I am struggling to combine my understanding of CIs and "bounds", the terms in Triscale, and the data I am seeing. I was conflicted, so I added some statements behinds the bounds - I was hoping I could ask you to comment, clarify, confirm?