Does NGBoost need iid in training set?
CadePGCM opened this issue · comments
I know that for typical gradient boosting algorithms like xgboost or lgbm this is often the case.
Is it also true for ngboost? I'm seeing signficant improvement on non-iid data using NGBoost over the above in classification (which NGBoost was not really designed to improve) so I'm curious on the theory.
No method for supervised learning (probabilistic or otherwise) strictly requires IID data. In general the prediction target is a functional minimizer of a given loss, which is well-defined even if there is inter-observation dependence. For example, when doing point prediction with MSE loss the thing you end up estimating is the conditional mean of
The thing you are trying to unbiasedly estimate at the end of the day is the prediction error. As long as a) your test set is drawn fairly from the same distribution as the data that you plan to eventually deploy the model to predict on and b) the statistical dependence between your training and test sets is the same as between the training data and the future deployment data, then your test-set error will be an unbiased estimate of the future generalization error. So you might want to do a training/test split by cluster instead of by individual, for example. Or you may need to do the splitting in a way that respects a time ordering.
NGboost is no different. Here our target of inference is the full conditional distribution
(closing but feel free to continue discussion if you have more questions!)
Hello @alejandroschuler and @CadePGCM, I'd like to follow up a bit on this question. While the case you explain above @alejandroschuler makes perfect sence, there are many applications where the observed outcomes can be correlated.
That is, what if
There have been some attempts at defining gradient boosting methods that handles this situation by introducing additional variance parameters, so that within-subject and between-subject variation is separated, e.g., https://www.degruyter.com/document/doi/10.1515/ijb-2020-0136/html. However, these methods do not seem to scale well, and I haven't seen use of regression trees as base learners in this case. There is however a very nice paper on random forests that seems to tackle many of these issues https://www.tandfonline.com/doi/full/10.1080/01621459.2021.1950003.
To me, this seems like a very open research problem, but perhaps NGBoost provides the right framework? I'd love to hear if you have any thoughts on this.
I think my argument still applies: just marginalize over