pipeline refit/partial_fit

Question

pipeline refit/partial_fit

ryan102590 opened this issue 5 years ago · comments

ryan102590 commented 5 years ago

Is there a way of doing a partial_fit or refit in the sklearn pipeline api for incremental learning?

Best regards

Igor Ivanov · Answer 1 · Thu Dec 20 2018 19:43:19 GMT+0800 (China Standard Time)

Unfortunately no.

sklearn.pipeline.Pipeline does not support partial_fit. This functionality is questionable when dealing with transformers. Check out these discussions: 3299 and 11321.

vecstack.StackingTransformer does not support partial_fit as well.

ryan102590 · Answer 2 · Sat Dec 22 2018 05:31:10 GMT+0800 (China Standard Time)

I actually tried the pipeline trick you referenced and that works pretty well if every function has a partial_fit def. Any ideas on how we could try to implement this in vecstack?

Right now, I'm calling partial_fits on the ones I can like MLPregressor, mondrian forests/trees, etc and doing a user input weighted summing of new predictions with the original stacked prediction which works well but I'd prefer a partial_fit stack that varies with all of the previous data though.

Igor Ivanov · Answer 3 · Wed Dec 26 2018 19:21:03 GMT+0800 (China Standard Time)

Concept of partial_fit for stacking is interesting, but this functionality has limited number of use cases. Currently I don’t plan to implement partial_fit for vecstack.

But if you really need this functionality you can try some custom modifications. Please tell me what is your task (regression, classification with labels, or classification with probabilities) and I will draft corresponding code snippet for you.

ryan102590 · Answer 4 · Thu Jan 03 2019 07:11:27 GMT+0800 (China Standard Time)

The task I'm working on is multioutput regression. For each output I'm trying to do an initial pre-testing prediction with an ensemble of fitted and partial_fitted stacks. Then, during testing the partially fitted stacks are incrementally refitted and weighted against the pre-testing predictions in a superlearner. My hope is to make the results more robust to new data and allow new information to be considered when it is available. The superlearner seems to be giving me the most issues though but it also gets much more accurate near the middle of testing. I'm using the MLXtend version currently for the partial_fit but it is doing strange things on the predictions that I don't experience with vecstack, even when not using partial_fit.

Igor Ivanov · Answer 5 · Sun Jan 06 2019 19:58:00 GMT+0800 (China Standard Time)

I drafted compact version of stacking procedure with partial_fit for regression task. You can start from this example and modify it for your specific needs.

import numpy as np
from sklearn.datasets import load_boston
from sklearn.model_selection import KFold
from sklearn.metrics import mean_absolute_error
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.base import clone

def stacking_partial(fold_models, X_train, y_train, X_test, 
                     n_folds=4, shuffle=False, random_state=0):
    """
    Example function to perform stacking with ``partial_fit``.
    Task: regression.

    Parameters
    ----------
    fold_models : list
        List of models for each fold. Can be created with following code: 
        ``[clone(estimator) for i in range(n_folds)]``
    """
    assert len(fold_models) == n_folds

    # Create empty arrays to store OOF created in each fold
    S_train = np.zeros((X_train.shape[0], 1))
    S_test_temp = np.zeros((X_test.shape[0], n_folds))

    # Init CV
    kf = KFold(n_splits=n_folds, shuffle=shuffle, random_state=random_state)

    # Scores from each fold
    scores = []

    # Loop across folds
    for fold_counter, (tr_index, te_index) in enumerate(kf.split(X_train, y_train)):
        
        # Split data and target
        X_tr = X_train[tr_index]
        y_tr = y_train[tr_index]
        X_te = X_train[te_index]
        y_te = y_train[te_index]
    
        # Get model for current fold
        model = fold_models[fold_counter]
        
        # Partial fit
        _ = model.partial_fit(X_tr, y_tr)
        
        # Predict OOF part of train set
        S_train[te_index, :] = model.predict(X_te).reshape(-1, 1)
        
        # Predict test set
        S_test_temp[:, fold_counter] = model.predict(X_test)
        
        # Compute and print OOF score of current fold
        score = mean_absolute_error(y_te, S_train[te_index, :])
        scores.append(score)
        print('fold %d: [%.8f]' % (fold_counter, score))
        
    # Compute mean of temporary test set preds to get final test set preds
    S_test = np.mean(S_test_temp, axis=1).reshape(-1, 1)
    
    # Mean OOF score + std
    print('----')
    print('MEAN:   [%.8f] + [%.8f]' % (np.mean(scores), np.std(scores)))

    return (S_train, S_test)

#------------------------------------------------------------------------------
# Example usage
#------------------------------------------------------------------------------

# Load and scale data
boston = load_boston()
X, y = boston.data, boston.target
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Create 2 batches of train data 
# (in practice we can have arbitrary number of batches)
X_train_batch_1 = X[:200]
y_train_batch_1 = y[:200]
X_train_batch_2 = X[200:400]
y_train_batch_2 = y[200:400]

# Create test data
X_test = X[400:]
y_test = y[400:]

# Number of folds
n_folds = 4
# Estimator
estimator = MLPRegressor(random_state=0, 
                         hidden_layer_sizes=(200, ), 
                         learning_rate_init=0.1)
# Clone estimator n_fold times in order to have independent model for each fold
fold_models = [clone(estimator) for i in range(n_folds)]

# Partial fit on 1st batch
S_train_1, S_test_1 = stacking_partial(fold_models, 
                                       X_train_batch_1, 
                                       y_train_batch_1, 
                                       X_test, n_folds)
# Partial fit on 2nd batch. Test data is the same here but it could be different
S_train_2, S_test_2 = stacking_partial(fold_models, 
                                       X_train_batch_2, 
                                       y_train_batch_2, 
                                       X_test, n_folds)
# Partial fit on Nth batch
# ...

# When we want to start from the beginning we need to reclone estimator
# fold_models = [clone(estimator) for i in range(n_folds)]

Scores on 1st batch:

fold 0: [13.31348141]
fold 1: [16.11789703]
fold 2: [14.18695878]
fold 3: [22.71476435]
----
MEAN:   [16.58327539] + [3.68258129]

Scores on 2nd batch:

fold 0: [6.81095776]
fold 1: [6.82251109]
fold 2: [8.15176789]
fold 3: [11.32682286]
----
MEAN:   [8.27801490] + [1.84268257]