vecxoz / vecstack

Python package for stacking (machine learning technique)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support for custom Cross Validation strategies

AlexandruBurlacu opened this issue · comments

The package looks amazing, but from what I saw, one can not pass a cross-validation sklearn object, only the number of folds, and enable/disable shuffling and stratification. This is an issue when trying to work with time series data, and using TimeSeriesSplit from sklearn. Would you consider adding maybe another toggle, like time_series={True, False} or even changing the API a bit, and instead of passing the number of folds and shuffle and stratified to have only one argument, like cv and pass a separate object from sklearn in there?

Thanks! I’m glad you like the package.

Your suggestions about custom cross-validation and TimeSeriesSplit are very reasonable. But these things are not so straightforward.

Custom cross-validation

It was my explicit decision to not allow custom cross-validation. The main reason is that less freedom means more stability. For example according to stacking concept it’s not allowed to predict data points which were used for training. But some cross-validation strategies do not guarantee this, i.e. folds may be drawn with replacement. It’s hard to debug such situations and may lead to bad user experience. In the near future I don’t plan to add support for custom cross-validation.

TimeSeriesSplit

The nature of time series data implies some side effects which make stacking a bit tricky. For example we can’t predict first part of train data (because we can't use data from future for training). As a result transformed train data (OOF) will have different shape (less examples). In some applications such result is not expected or not acceptable. So again in the near future I don’t plan to add support for TimeSeriesSplit.

Solution

Check out this tutorial. You can take example of stacking from scratch and build on top of it some custom modification according to your needs.

Thank you for the suggested tutorial, and for explaining your reasons behind the more constrained cross-validation scheme. I think the issue should be closed then.

Hi,

I think implementing timeseriessplit is kind of strightforward too.

As you mentioned, just the first fold will be filled with zeros. This is just something we have to accept when try to stacking for time series models.