Reduce false positive rate of timing tests and add tools for handling them

Question

Reduce false positive rate of timing tests and add tools for handling them

tomato42 opened this issue 4 years ago · comments

While we have tests to verify Lucky13 and Bleichenbacher now:

they have quite significant false positive rate (>20%). We should improve the used statistical classifiers, handling of outliers, way the data is collected, etc., so that the false positive rate is more manageable (<5%)

Hubert Kario · Answer 1 · Tue Jun 16 2020 23:48:14 GMT+0800 (China Standard Time)

see also #106

Hubert Kario · Answer 2 · Mon Jun 29 2020 02:42:54 GMT+0800 (China Standard Time)

Actually, we should be careful with sample sizes, as too small sample sizes will not show effect sizes that are measurable in practice. See https://stats.stackexchange.com/a/2522/289885 :

In a situation where a "simple" null is tested against a "compound" alternative, as in classic t-tests or z-tests, it typically takes a sample size proportional to 1/ϵ² to detect an effect size of ϵ. There's a practical upper bound to this in any study, implying there's a practical lower bound on a detectable effect size. So, as a theoretical matter der Laan and Rose are correct, but we should take care in applying their conclusion.

i.e. to detect a 1% effect size we need a sample size of 10k, and 1M sample size to detect an effect size of 0.1%

and we need to remember that p-value is independent of sample size: the 5% false positive rate for alpha of 0.05 is a constant

for very large sample sizes and quick response times we may need to look into checking the statistical importance not statistical significance of the result (as a result that tells us that one class is different than another by less that one CPU cycle, then it's not a meaningful result), see https://stats.stackexchange.com/a/7849/289885