vecxoz / vecstack

Python package for stacking (machine learning technique)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to predict

kanekcwu opened this issue · comments

After I created my models and happy with the results. How can I save the models and use it to predict on real life data?

I'm also a little confused here. Is there a method you can use, after having trained your final model, to transform a hitherto unseen dataset into the format (base level predictions) that the final model is expecting?

What I did as a gapstop method is to use x_test as the unseen dataset. But then I have to recreate the models everytime and takes forever.

How can I save the models and use it to predict on real life data?

You can easily save your models and use them to predict. I will show how to do that in example below.
Architecture by itself does not imply ability to save models internally. This package is relatively low level tool, so it is expected that such operations like saving models are performed manually by end user (if needed). Intention is to give more freedom and save resources.

Is there a method you can use, after having trained your final model, to transform a hitherto unseen dataset into the format (base level predictions) that the final model is expecting?

Of course there is. Please see example.

Example

Let’s say we want to perform stacking for regression task and we expect new (unseen) test sets in the future. Of course we don’t want to refit all our models every time, we want just to predict. Approach is as follows:

[1] Define 1st level models:

models_L1 = [
    ExtraTreesRegressor(random_state=0),
    RandomForestRegressor(random_state=0)
]

[2] Fit 1st level models (and possibly save models in files):

model_L1_0 = models_L1[0]
_ = model_L1_0.fit(X_train, y_train)
# save model in file if you need

model_L1_1 = models_L1[1]
_ = model_L1_1.fit(X_train, y_train)
# save model in file if you need

[3] Create stacked features for train set (S_train). Then fit 2nd level model on S_train (and possibly save model in file):

# note that we compute only oof (mode='oof'). 
S_train, _ = stacking(models_L1,
                      X_train, y_train, None,
                      regression=True,
                      mode='oof',
                      random_state=0,
                      verbose=2)
                           
model_L2 = LinearRegression()
_ = model_L2.fit(S_train, y_train)
# save model in file if you need

[4] Then new test set (X_test_new) comes. We load our 1st level models and predict new test set to get stacked features (S_test_new):

y_pred_L1_0 = model_L1_0.predict(X_test_new)
y_pred_L1_1 = model_L1_1.predict(X_test_new)
S_test_new = np.c_[y_pred_L1_0, y_pred_L1_1]

[5] Then we load our 2nd level model and predict S_test_new to get final prediction:

y_pred_new = model_L2.predict(S_test_new)

[6] Each time new test set comes we just repeat [4] and [5]

That's it.

Thanks for the great reply!

Thanks for the reply. If possible can you clairfy how the oof will help in step 3? For example when we predict on the X_test_new data it seems that we are just using the previous model from step 2. Are we using the stacking function to ensure the model_L2 be more accurate?

We can formulate the whole stacking concept as follows:
Let’s predict X_train and X_test with some 1st level models, and then use these predictions as features for 2nd level model.
So basically we want the following:

model_L1 = XGBRegressor(random_state=0)
_ = model_L1.fit(X_train, y_train)
S_train = model_L1.predict(X_train) # <- DOES NOT work due to overfitting
S_test = model_L1.predict(X_test)   # <- WORKS

But if we fit on X_train we can’t just predict X_train, because our 1st level model has already seen X_train, and its prediction will be overfitted. To avoid overfitting we perform cross-validation procedure and in each fold we predict out-of-fold (OOF) part of X_train.

For more details please see stacking concept tutorial.


how the oof will help in step 3?

OOF helps to avoid overfitting. OOF is the only way to get predictions for train set.

For example when we predict on the X_test_new data it seems that we are just using the previous model from step 2.

Yes. It’s correct implementation, because 1st level models from step 2 were fitted on X_train and did not see X_test_new.

Are we using the stacking function to ensure the model_L2 be more accurate?

Stacking function wraps cross-validation procedure and helps to generate out-of-fold (OOF) predictions to avoid overfitting. Without this approach results will be meaningless.

Thanks for explaining!!

Related note.

Scikit-learn compatible API for stacking was released in version 0.3.0.
Please see usage and full example.