topepo / FES

Code and Resources for "Feature Engineering and Selection: A Practical Approach for Predictive Models" by Kuhn and Johnson

Home Page:https://bookdown.org/max/FES

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Section 3.4.1 variance reduction not Sqrt(R)

TimothyMasters opened this issue · comments

Section 3.4.1 discusses R repeats of V-fold cross validation and incorrectly states that the variance reduction will be by a factor of Sqrt(R). This would be true only if the measures across repeats were independent, which they are not. In fact, the variance reduction depends on the value of V. As an extreme example, for leave-one-out cross validation there can be no variance reduction at all. Even at the other extreme, with V=2, variance reduction will not reach Sqrt(R). To help understand the issues involved, consider two facts: there are a finite (though very large) number of possible partitions for cross validation, and the training set is itself a random sample from the population.

We do note that it is an approximation.

I don't think that it is misleading since there are many instances where the resampled statistics can be treated as independent even though the resamples contain some of the same data. The bootstrap is a good example and there are multiple proofs that should that those resamples converge to the empirical distribution of the resampled statistic.

So, you are absolutely correct that the variance reduction is not equal to Sqrt(R). I'm arguing that the text accurately indicates that the benefits of adding replicates is decreasing with R.