rasbt / python-machine-learning-book

The "Python Machine Learning (1st edition)" book code repository and info resource

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Chapter: Combining weak to strong learners via random forests [sample size]

VedAustin opened this issue · comments

Via the sample size n of the bootstrap sample, we control the bias-variance tradeoff of the random forest. By choosing a larger value for n, we decrease the randomness and thus the forest is more likely to overfit. On the other hand, we can reduce the degree of overfitting by choosing smaller values for n at the expense of the model performance.

To me this implies that I should choose sample size n, that is smaller than N (original training set size).

In most implementations, including the RandomForestClassifier implementation in scikit-learn, the sample size of the bootstrap sample is chosen to be equal to the number of samples in the original training set, which usually provides a good bias-variance tradeoff

But the above got me confused: If we choose n = N, then aren't we overfitting unless the algorithm is bootstrapping aggressively - repeating the values many times over?

Via the sample size n of the bootstrap sample, we control the bias-variance tradeoff of the random forest. By choosing a larger value for n, we decrease the randomness and thus the forest is more likely to overfit. On the other hand, we can reduce the degree of overfitting by choosing smaller values for n at the expense of the model performance.

To me this implies that I should choose sample size n, that is smaller than N (original training set size).

Thanks for asking and highlighting this point. I agree that this may sound confusing at first, and I'll make a note to adjust the language a bit for a potential second edition. However, the statement should be correct overall. With model performance, I meant the "generalization performance" estimated via an independent the test set. It's basically all about the bias-variance trade-off. E.g., if you have a train/test performance of 99%/90% accuracy, this would indicate slight overfitting, but compared to e.g., 70%/70%, the former is still to be preferred over the latter I'd say ;).

In most implementations, including the RandomForestClassifier implementation in scikit-learn, the sample size of the bootstrap sample is chosen to be equal to the number of samples in the original training set, which usually provides a good bias-variance tradeoff

But the above got me confused: If we choose n = N, then aren't we overfitting unless the algorithm is bootstrapping aggressively - repeating the values many times over

We really can't generalize this rule of thumb to all datasets; it's more or less application dependent. As far as I know, the n=N is empirically the best trade-off, but altering it for certain applications may improve (or worsen) the model performance. In some cases it may overfit more than others, but it typically works well and it's the "original" random forest equation after all: . However, this is also something you look at when you assess your model during cross validation.

Here, you have a probability of 1 - (1 - 1\N)^N that a training sample occurs in the training set, which is also commonly used for bootstrap model assessment (asymptotically, this would be 1 - e ^(-1) = 0.632 )

Hm, but I don't happen to have a good paper at hand for an empirical comparison; I believe that the bootstrap-size parameter is probably the one that is typically NOT tuned (compared to the number of splitting features and tree depth); I would say that it's part of the "random forest definition" originating from Breimann's bagging algorithm from 1996.

I would close this question for now, but please feel free to comment further on it. And I will post empirical hyperparam comparisons if I stumble upon it!

Before I read this I had always assumed that n<N - maybe I derived this incorrectly from d < D (total number of features)

Before I read this I had always assumed that n<N - maybe I derived this incorrectly from d < D (total number of features)

Yeah, I think that's probably the d < D (total number of features) you meant. "Typically", a bootstrap sample (e.g., in model evaluation or bagging & RF) has the same number of samples as the training set. Usually, it's the number of features samples at each split of the decision trees that is tuned via hyperparameter optimization. However, some people also change the number of samples in the bootstrap sample, since it may have a positive effect on the performance in certain application -- note that not many software packages let you change the boostrap size (e.g., I think scikit-learn doesn't).