How to combine early stopping?

Question

How to combine early stopping?

ZeroAlcoholic opened this issue 6 years ago · comments

Thanks for your contribution. I was looking for a great api for stacking then found your good package .

I am wondering that is it possible to combine the early_stopping in lightgbm or EarlyStopping in keras with VECSTACK (because I don't know how to do it) ?

Igor Ivanov · Answer 1 · Tue Jun 26 2018 21:51:36 GMT+0800 (China Standard Time)

EDIT
predict call inside class definition was modified.
Before: num_iteration=super(WrapLGB, self).best_iteration_
Now: num_iteration=self.best_iteration_
See explanation in the comment below.

Yes, it's possible. This task is accomplished by passing estimator's fit and predict arguments through user-defined class wrapper.

You should remember that stacking procedure performs cross-validation inside. So if you initialize StackingTransformer with 4 folds like so stack = StackigTransformer(n_folds=4) it means that when you call stack.fit(X_train, y_train) you actually fit 4 models on 3/4 of X_train each. Now you want to perform early stopping for each of these 4 models and you need a validation set to compute scores. Remember that you can NOT use out-of-fold part (1/4 of X_train) for early stopping because in each fold you predict this part and you can NOT touch it to avoid overfitting.

To get validation set you have two options:

You can use the same fixed validation set for each of 4 folds. You should prepare this set beforehand.
You can generate new validation set in each fold e.g. 1/5 of current fold's training data. It means that you will actually train on (4/5) of (3/4) of X_train (i.e. 12/20 of X_train) and perform early stopping on (1/5) of (3/4) of X_train (i.e. 3/20 of X_train). Just a reminder: out-of-fold part (which you can NOT touch) is 1/4 of X_train (i.e. 5/20 of X_train). See example below.

Option 2. Complete example

# Set up regression problem
import numpy as np
np.random.seed(42)
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error as mse
from sklearn.model_selection import train_test_split
from lightgbm import LGBMRegressor
from vecstack import StackingTransformer
boston = load_boston()
X_train, X_test, y_train, y_test = train_test_split(boston.data, 
                                                    boston.target, 
                                                    test_size=0.2, 
                                                    random_state=42)
#----------------------------------------------------------
# User-defined class wrapper
class WrapLGB(LGBMRegressor):
    """This is template for user-defined class wrapper.
    Use this template to pass any ``fit`` and ``predict`` arguments.
    """
    def fit(self, X, y):
        X_tr, X_val, y_tr, y_val = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=42)
        return super(WrapLGB, self).fit(X_tr, y_tr, 
                                        early_stopping_rounds=5, 
                                        eval_set=[(X_val, y_val)], 
                                        eval_metric='l2', verbose=1)

    def predict(self, X):
        return super(WrapLGB, self).predict(X, 
               num_iteration=self.best_iteration_)
#----------------------------------------------------------
# Initialize StackingTransformer
estimators = [('wraplgb', WrapLGB(learning_rate=0.9, 
                                  n_estimators=1000, 
                                  random_state=42))]
stack = StackingTransformer(estimators, regression=True, 
                            n_folds=4, metric=mse)
# Fit and transform
stack = stack.fit(X_train, y_train)
S_train = stack.transform(X_train)
S_test = stack.transform(X_test)

Output

I put raw output here for demonstration.
You can see that early stopping was performed in each of 4 folds:

[1]	valid_0's l2: 32.0246
Training until validation scores don't improve for 5 rounds.
[2]	valid_0's l2: 23.464
[3]	valid_0's l2: 22.2144
[4]	valid_0's l2: 19.8271
[5]	valid_0's l2: 22.7295
[6]	valid_0's l2: 21.3527
[7]	valid_0's l2: 22.6876
[8]	valid_0's l2: 22.4059
[9]	valid_0's l2: 21.4023
Early stopping, best iteration is:
[4]	valid_0's l2: 19.8271

[1]	valid_0's l2: 22.6718
Training until validation scores don't improve for 5 rounds.
[2]	valid_0's l2: 22.0576
[3]	valid_0's l2: 20.7717
[4]	valid_0's l2: 21.4487
[5]	valid_0's l2: 20.7593
[6]	valid_0's l2: 19.9866
[7]	valid_0's l2: 20.8062
[8]	valid_0's l2: 20.8037
[9]	valid_0's l2: 20.7226
[10]	valid_0's l2: 20.7807
[11]	valid_0's l2: 22.9261
Early stopping, best iteration is:
[6]	valid_0's l2: 19.9866

[1]	valid_0's l2: 36.1314
Training until validation scores don't improve for 5 rounds.
[2]	valid_0's l2: 24.1133
[3]	valid_0's l2: 17.6557
[4]	valid_0's l2: 20.1154
[5]	valid_0's l2: 20.1621
[6]	valid_0's l2: 19.742
[7]	valid_0's l2: 18.1264
[8]	valid_0's l2: 17.9662
Early stopping, best iteration is:
[3]	valid_0's l2: 17.6557

[1]	valid_0's l2: 32.7848
Training until validation scores don't improve for 5 rounds.
[2]	valid_0's l2: 26.3399
[3]	valid_0's l2: 27.7075
[4]	valid_0's l2: 25.7245
[5]	valid_0's l2: 24.1551
[6]	valid_0's l2: 22.0104
[7]	valid_0's l2: 19.5018
[8]	valid_0's l2: 19.4044
[9]	valid_0's l2: 19.7235
[10]	valid_0's l2: 19.9468
[11]	valid_0's l2: 19.242
[12]	valid_0's l2: 18.8428
[13]	valid_0's l2: 19.4026
[14]	valid_0's l2: 19.7783
[15]	valid_0's l2: 20.3338
[16]	valid_0's l2: 20.4569
[17]	valid_0's l2: 20.5523
Early stopping, best iteration is:
[12]	valid_0's l2: 18.8428

Igor Ivanov · Answer 2 · Tue Jul 03 2018 17:23:43 GMT+0800 (China Standard Time)

Please pay attention.

I made a little but important modification in the predict call inside class definition in previous comment.
Before: num_iteration=super(WrapLGB, self).best_iteration_
Now: num_iteration=self.best_iteration_

In this case both variants work identically because best_iteration_ is a property.
But we should remember that super(WrapLGB, self).best_iteration_ works only if best_iteration_ is a property whereas self.best_iteration_ works always (doesn’t matter whether best_iteration_ is a property or just data attribute (class field)).

ZeroAlcoholic · Answer 3 · Wed Jul 04 2018 10:00:47 GMT+0800 (China Standard Time)

Thank you. I am thinking about the 'can NOT touch' part.... I will try it. Thanks.