Reduce false positive rate of timing tests and add tools for handling them
tomato42 opened this issue · comments
While we have tests to verify Lucky13 and Bleichenbacher now:
- https://github.com/tomato42/tlsfuzzer/blob/master/scripts/test-lucky13.py
- https://github.com/tomato42/tlsfuzzer/blob/master/scripts/test-bleichenbacher-timing.py
they have quite significant false positive rate (>20%). We should improve the used statistical classifiers, handling of outliers, way the data is collected, etc., so that the false positive rate is more manageable (<5%)
see also #106
Actually, we should be careful with sample sizes, as too small sample sizes will not show effect sizes that are measurable in practice. See https://stats.stackexchange.com/a/2522/289885 :
In a situation where a "simple" null is tested against a "compound" alternative, as in classic t-tests or z-tests, it typically takes a sample size proportional to 1/ϵ² to detect an effect size of ϵ. There's a practical upper bound to this in any study, implying there's a practical lower bound on a detectable effect size. So, as a theoretical matter der Laan and Rose are correct, but we should take care in applying their conclusion.
i.e. to detect a 1% effect size we need a sample size of 10k, and 1M sample size to detect an effect size of 0.1%
and we need to remember that p-value is independent of sample size: the 5% false positive rate for alpha of 0.05 is a constant
for very large sample sizes and quick response times we may need to look into checking the statistical importance not statistical significance of the result (as a result that tells us that one class is different than another by less that one CPU cycle, then it's not a meaningful result), see https://stats.stackexchange.com/a/7849/289885