Using different data transformations and fit parameters for different models

Question

Using different data transformations and fit parameters for different models

davidolmo opened this issue 6 years ago · comments

Hi Igor,

Congratulations for your package. I've been searching for a stacking package and this nails it (both for simplicity and efectiveness). Thanks for your contribution

Is there any possibility to stack already trained models with your package? There are 2 reasons for this:
-People might want to set fit arguments to the models (currently not available as the stacking function will actually train the models)
-We might want to use different data scaling and preprocessing techniques for different algorithms (label encoding for tree-based methods and one hot for linear)

For example, H2O stacking allows users to stack already trained models:
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/stacked-ensembles.html

I would love to contribute to your package but unfortunately my technical level would be too dangerous for your code :P

Igor Ivanov · Answer 1 · Wed Feb 28 2018 22:31:20 GMT+0800 (China Standard Time)

Hi David.
Thanks a lot! I'm happy that you like this tool.

Is there any possibility to stack already trained models with your package?

Quote from H2O:

Before training a stacked ensemble, you will need to train and cross-validate a set of "base models" which will make up the ensemble.

stacking function from my package perform exactly this actions: train and cross-validate a set of "base models". As a result it returns features (predictions from base models), so after you call it and the run is complete you got all you need for 2nd level. It means that you don't need to train base models separately and often you don't need already trained models at all.

If you want to reuse trained models to predict future data please see this thread for details.

If you want to use different training data or different hyperparameters for base models, you just need to call stacking function several times and combine resulting features. Please see example below.

We might want to use different data scaling and preprocessing techniques for different algorithms (label encoding for tree-based methods and one hot for linear)

It's easy to do.

import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from xgboost import XGBRegressor
from vecstack import stacking

# Create demo data (all features are categorical)
np.random.seed(0)
X_train = np.random.randint(5, 10, 404*13).reshape(404, 13)
X_test = np.random.randint(5, 10, 102*13).reshape(102, 13)
y_train = np.random.rand(404)
y_test = np.random.rand(102)

# -----------------------------------------------------------------------------
# Perform stacking with label encoded data (tree algorithms)
# -----------------------------------------------------------------------------

le = LabelEncoder()

# Create empty arrays for label encoded data
X_train_le = np.zeros_like(X_train)
X_test_le = np.zeros_like(X_test)

# Encode with labels
# Assuming all columns are categorical
for i in range(X_train.shape[1]):
    X_train_le[:, i] = le.fit_transform(X_train[:, i])
    X_test_le[:, i] = le.transform(X_test[:, i])

models_le = [RandomForestRegressor(random_state=0),
             ExtraTreesRegressor(random_state=0)]
             
S_train_le, S_test_le = stacking(models_le, X_train_le, 
                                 y_train, X_test_le, verbose=2)

# -----------------------------------------------------------------------------
# Perform stacking with one-hot encoded data (linear algorithms)
# -----------------------------------------------------------------------------

ohe = OneHotEncoder()
X_train_ohe = ohe.fit_transform(X_train)
X_test_ohe = ohe.transform(X_test)

models_ohe = [LinearRegression(),
              Ridge(random_state=0)]
              
S_train_ohe, S_test_ohe = stacking(models_ohe, X_train_ohe, 
                                   y_train, X_test_ohe, verbose=2)

# -----------------------------------------------------------------------------
# Combine 1st level features
# -----------------------------------------------------------------------------

S_train_final = np.c_[S_train_le, S_train_ohe]
S_test_final = np.c_[S_test_le, S_test_ohe]

# -----------------------------------------------------------------------------
# Fit 2nd level model and get final prediction
# -----------------------------------------------------------------------------

model_L2 = XGBRegressor(random_state=0)
_ = model_L2.fit(S_train_final, y_train)
y_pred_final = model_L2.predict(S_test_final)

If my answer is not complete please ask again with some code example specific to your task.

Dadv · Answer 2 · Thu Mar 01 2018 05:51:32 GMT+0800 (China Standard Time)

Your answer is super complete, you nailed it again! Thank you so much for your prompt response. I will do exactly as you explained

One more question: is there any way to pass FIT arguments of the models to the stacking function? For example, the LightGBM API for sklearn has the "categorical_features" parameter in the fit function:

https://lightgbm.readthedocs.io/en/latest/Python-API.html#lightgbm.LGBMRegressor.fit

You included the "sample_weights" fit argument in the stacking function, which is very nice. What about the rest (such as categorical features)?

Thank you Igor

Igor Ivanov · Answer 3 · Fri Mar 02 2018 21:02:59 GMT+0800 (China Standard Time)

At current point fit arguments are not supported.
But of course your suggestion is reasonable. Probably I will add this in the future.

And yes, you can use fit arguments right now :)
I built a quick patch which gives you limited support for fit arguments. Git branch is called fit_args. You can look at the patch in this commit. Limitation is that you can use only one 1st level algorithm in models list for each call of stacking function, because fit arguments are passed to all algorithms in the list and different algorithms may have different fit arguments. If you do not pass fit arguments you can use any number of algorithms in models list as usual.

You can reinstall patched version directly from github branch:

pip3 install --user --upgrade --force-reinstall --no-deps https://github.com/vecxoz/vecstack/archive/fit_args.zip

You can return to stable version anytime you want:

pip3 install --user --upgrade --force-reinstall --no-deps vecstack

So now you can try to run following example with fit arguments:

import numpy as np
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor
from vecstack import stacking

# Create demo data (all features are categorical)
np.random.seed(0)
X_train = np.random.randint(5, 10, 404*13).reshape(404, 13)
X_test = np.random.randint(5, 10, 102*13).reshape(102, 13)
y_train = np.random.rand(404)
y_test = np.random.rand(102)

# Only one algorithm for each call of stacking function
models_lgbm = [LGBMRegressor(random_state=0, n_estimators=10, min_child_samples=2)]

S_train_lgbm, S_test_lgbm = stacking(models_lgbm, 
                                     X_train, y_train, X_test, 
                                     verbose=2, 
                                     # fit arguments for LGBMRegressor ONLY
                                     categorical_feature='auto')

# Only one algorithm for each call of stacking function
models_xgb = [XGBRegressor(random_state=0, n_estimators=10)]

S_train_xgb, S_test_xgb = stacking(models_xgb, 
                                   X_train, y_train, X_test, 
                                   verbose=2, 
                                   # fit arguments for XGBRegressor ONLY
                                   # ...
                                   )

Dadv · Answer 4 · Mon Mar 05 2018 00:35:39 GMT+0800 (China Standard Time)

Oh this is perfect. Even if the fits argument can only be used for 1 model, I understand that following your example on your previous answer I can easily combine the stacked prediction frames. It seems like I have everything needed to use this on a production pipeline :-)

Thank you by the way for your detailed explanation about how to upgrade directly from the github branch

You should definitely add this on your description of the package. I'm pretty sure that a lot of people wanders how to combine models with different fit parameters and with different data transformations. By using these tricks this package performs everything that is needed!

Thanks Igor