official-stockfish / fishtest

The Stockfish testing framework

Home Page:https://tests.stockfishchess.org/tests

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Concerns about the stats section on the tests_view

peregrineshahin opened this issue · comments

this is more of a question/wondering than an issue, currently, we have a dedicated stats page which I think is the most up-to-date in terms of math-related stats. OTOH we have a stats section in the right panel of the tests_view which was last maintained +11 years ago, the p-value shown is inconsistent with the definition of LOS = 1 - p-value so that made me wonder.
I think this part of the page is already useless since we have a stats page, but at the same time, I want to know if the labels and the math are correct here or if they are confusing.
Capture

https://tests.stockfishchess.org/tests/view/6604a9020ec64f0526c583da
https://tests.stockfishchess.org/tests/view/660ad1b60ec64f0526c5dd23

@vdbergh should we remove this section from tests_view?

I guess the chi^2-value and the degrees of freedom represent useless information (they are some kind of sanity check and this could be moved to the raw stats page). However the p-value is somewhat useful I think.

Note that the p-value here is unrelated to the p-value of the test.

The p-value here measures how likely the value of the chi^2 statistic is (some kind of sums of squares) under the null hypotheses that all workers are identical (under the null hypothesis the p-value is uniformly distributed, so a value < 0.05 should happen in only 5% of the cases if all workers are identical). In case of a very low p-value, the residual should give an indication which worker is the culprit.

The chi^2 test was created when people were raising alarming reports that some workers were cooking the results. The chi^2 test showed that everything was normal and that the supposed anomalous behavior could be fully explained by statistical fluctuations.

This being said, the null hypothesis is of course not strictly fulfilled as the workers are all slightly different. For large core workers such differences are amplified in the outcome of the test. So I am not sure if the p-value distribution is truly uniform. That could be checked with another chi^2 test 👀

Amazing! I really appreciate the reply!