antoinecarme / pyaf

PyAF is an Open Source Python library for Automatic Time Series Forecasting built on top of popular pydata modules.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add the possibility to use cross validation when training PyAF models

antoinecarme opened this issue · comments

Following the investigation performed in #53, implement a form of cross validation for PyAF models.

Specifications :

  1. Cut the dataset in many folds according to a scikit-learn time series split :
    http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation
    number of folds => user option (default = 10)

  2. To have enough data, use only the last n/2 folds for estimating the models (thanks to forecast R package ;). The default splits look like this :
    [5 ] [6]
    [5 6 ] [7]
    [5 6 7] [8]
    [5 6 7 8] [9]
    [5 6 7 8 9] [10]

  3. Use the model decomposition type or formula as a hyperparameter and optimize it. select the decomposition(s) with the lowest mean MAPE on the validation datasets of all the possible splits.

  4. Among all the chosen decompositions, select the model with lowest complexity (~ number of inputs)

  5. Execute the procedure on the ozone and air passengers datsets and compare with the non-cross validation models (=> 2 jupyter notebooks)

Classical PyAF modeling is a special case of this cross validation with 1 split (nfolds =5 , split = [1 2 3 4] [5] ). So the implementation should be made by adapting the existing code. Training each one of the splits is equivalent to training an old model.

Hi, I've been watching your project for a while (mostly - I have been working on a similar project, which comes at this from a different perspective 😛).
I'd just like to note that, from the business case, there are (at least) 2 different kinds of time series CV: with and without retraining on the set. The first one (that you've described above) is useful for settings where you can constantly re-train your model. The second one is for when you don't have the ability to re-train, but want to know what it will do on future, shorter folds. This is relevant for models with hidden components (e.g. ARIMA, state-space models, RNN's, ...) where the state can be much different when starting later than from the beginning (as an analogy, a Markov chain that isn't yet in the stationary distribution).

@NowanIlfideme

Thanks a lot for your interest in PyAF. Comments like these a re always welcome. Hope you enjoyed.

Models with state/hidden components are not yet supported but if you look closely, PyAF is always evolving, Cross validation work started a year ago, its first implementation will be available in the few coming weeks.

Can you please elaborate a little bit more on the second case (python example in a gist ?). Any docs/references ?

I don't quite have the time to make a full example, I hope a block thing will work. :)

Full Set:
[1 2 ... N N+1 ... 2N]

Train (same for all):
[1 2 3 ... N]

Validation:
Sees [1 ... N], predicts [N+1]
Sees [2 ... N+1], predicts [N+2]
...
Sees [N-1 ... 2N-1], predicts [2N]

If you only use stateless models, this is the same as validating on sets [N+1, ... 2N]. However, for stateful models, this means you will always be using [N*num_per_set] steps to "warm up" your model, and thus get consistent behavior (you'd do this in production, as well).

As an alternative, you could use the following scheme for stateless models as well:

Trains on [1 ... N], predicts [N+1]
Trains on [2 ... N+1], predicts [N+2]
...
Trains on [N-1 ... 2N-1], predicts [2N]

This will always give a "window", and again be consistent. However, the end use of these methods is different. 😃

The block thing is clear and very interesting ;). Will keep this aside for implementing support for stateful models.

Do you have any book reference for this kind of stuff ? putting time series models in production etc.

I'm going mainly by experience, sorry that I can't give any written reference. Cheers!

Cheers!

@NowanIlfideme

What about summarizing your experience in a github repository (markdown) ? I am also not aware of a written reference for this kind of stuff. Please think of this when you have some time.

Thanks a lot.

This is how to adapt the training process to activate the cross validation in PyAF (with 7 folds) :

    import pyaf.ForecastEngine as autof
    lEngine = autof.cForecastEngine()
    lEngine.mOptions.mCrossValidationOptions.mMethod = "TSCV";
    lEngine.mOptions.mCrossValidationOptions.mNbFolds = 7
    lEngine.train(ozone_dataframe , 'Month' , 'Ozone', 12);
    lEngine.getModelInfo();

FIXED!!!!