pipeline refit/partial_fit
ryan102590 opened this issue · comments
Is there a way of doing a partial_fit or refit in the sklearn pipeline api for incremental learning?
Best regards
I actually tried the pipeline trick you referenced and that works pretty well if every function has a partial_fit def. Any ideas on how we could try to implement this in vecstack?
Right now, I'm calling partial_fits on the ones I can like MLPregressor, mondrian forests/trees, etc and doing a user input weighted summing of new predictions with the original stacked prediction which works well but I'd prefer a partial_fit stack that varies with all of the previous data though.
Concept of partial_fit
for stacking is interesting, but this functionality has limited number of use cases. Currently I don’t plan to implement partial_fit
for vecstack
.
But if you really need this functionality you can try some custom modifications. Please tell me what is your task (regression, classification with labels, or classification with probabilities) and I will draft corresponding code snippet for you.
The task I'm working on is multioutput regression. For each output I'm trying to do an initial pre-testing prediction with an ensemble of fitted and partial_fitted stacks. Then, during testing the partially fitted stacks are incrementally refitted and weighted against the pre-testing predictions in a superlearner. My hope is to make the results more robust to new data and allow new information to be considered when it is available. The superlearner seems to be giving me the most issues though but it also gets much more accurate near the middle of testing. I'm using the MLXtend version currently for the partial_fit but it is doing strange things on the predictions that I don't experience with vecstack, even when not using partial_fit.
I drafted compact version of stacking procedure with partial_fit
for regression task. You can start from this example and modify it for your specific needs.
import numpy as np
from sklearn.datasets import load_boston
from sklearn.model_selection import KFold
from sklearn.metrics import mean_absolute_error
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.base import clone
def stacking_partial(fold_models, X_train, y_train, X_test,
n_folds=4, shuffle=False, random_state=0):
"""
Example function to perform stacking with ``partial_fit``.
Task: regression.
Parameters
----------
fold_models : list
List of models for each fold. Can be created with following code:
``[clone(estimator) for i in range(n_folds)]``
"""
assert len(fold_models) == n_folds
# Create empty arrays to store OOF created in each fold
S_train = np.zeros((X_train.shape[0], 1))
S_test_temp = np.zeros((X_test.shape[0], n_folds))
# Init CV
kf = KFold(n_splits=n_folds, shuffle=shuffle, random_state=random_state)
# Scores from each fold
scores = []
# Loop across folds
for fold_counter, (tr_index, te_index) in enumerate(kf.split(X_train, y_train)):
# Split data and target
X_tr = X_train[tr_index]
y_tr = y_train[tr_index]
X_te = X_train[te_index]
y_te = y_train[te_index]
# Get model for current fold
model = fold_models[fold_counter]
# Partial fit
_ = model.partial_fit(X_tr, y_tr)
# Predict OOF part of train set
S_train[te_index, :] = model.predict(X_te).reshape(-1, 1)
# Predict test set
S_test_temp[:, fold_counter] = model.predict(X_test)
# Compute and print OOF score of current fold
score = mean_absolute_error(y_te, S_train[te_index, :])
scores.append(score)
print('fold %d: [%.8f]' % (fold_counter, score))
# Compute mean of temporary test set preds to get final test set preds
S_test = np.mean(S_test_temp, axis=1).reshape(-1, 1)
# Mean OOF score + std
print('----')
print('MEAN: [%.8f] + [%.8f]' % (np.mean(scores), np.std(scores)))
return (S_train, S_test)
#------------------------------------------------------------------------------
# Example usage
#------------------------------------------------------------------------------
# Load and scale data
boston = load_boston()
X, y = boston.data, boston.target
scaler = StandardScaler()
X = scaler.fit_transform(X)
# Create 2 batches of train data
# (in practice we can have arbitrary number of batches)
X_train_batch_1 = X[:200]
y_train_batch_1 = y[:200]
X_train_batch_2 = X[200:400]
y_train_batch_2 = y[200:400]
# Create test data
X_test = X[400:]
y_test = y[400:]
# Number of folds
n_folds = 4
# Estimator
estimator = MLPRegressor(random_state=0,
hidden_layer_sizes=(200, ),
learning_rate_init=0.1)
# Clone estimator n_fold times in order to have independent model for each fold
fold_models = [clone(estimator) for i in range(n_folds)]
# Partial fit on 1st batch
S_train_1, S_test_1 = stacking_partial(fold_models,
X_train_batch_1,
y_train_batch_1,
X_test, n_folds)
# Partial fit on 2nd batch. Test data is the same here but it could be different
S_train_2, S_test_2 = stacking_partial(fold_models,
X_train_batch_2,
y_train_batch_2,
X_test, n_folds)
# Partial fit on Nth batch
# ...
# When we want to start from the beginning we need to reclone estimator
# fold_models = [clone(estimator) for i in range(n_folds)]
Scores on 1st batch:
fold 0: [13.31348141]
fold 1: [16.11789703]
fold 2: [14.18695878]
fold 3: [22.71476435]
----
MEAN: [16.58327539] + [3.68258129]
Scores on 2nd batch:
fold 0: [6.81095776]
fold 1: [6.82251109]
fold 2: [8.15176789]
fold 3: [11.32682286]
----
MEAN: [8.27801490] + [1.84268257]