Test for statistical significance in BasicTrainer.has_improved()

Question

Test for statistical significance in BasicTrainer.has_improved()

senarvi opened this issue 9 years ago · comments

Compare the probabilities given by the new model and the old model to word or sentences in the validation set, and determine if the new state is significantly better. The probabilities are assumed to be independent; if word-level probabilities are used, the assumption is violated.

Possible test for statistical significance include the Sign test and Wilcoxon test. For the former one has to collect statistics of how many probabilities were lower/same/greater in the new model, and for the latter the differences of the probabilities.

An example implementation of the Sign test can be found in the cumbin script included in SRILM. For example, if new model gives higher probability for 1080 out of 2000 words (and there are no ties), then the significance levels are computed as

% cumbin 2000 1080
One-tailed: P(k >= 1080 | n=2000, p=0.5) = 0.00018750253721029
Two-tailed: 2*P(k >= 1080 | n=2000, p=0.5) = 0.00037500507442058