vecxoz / vecstack

Python package for stacking (machine learning technique)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

high variability of StackingTransformer on training data

KerenHalperin opened this issue · comments

Hi, I was wondering if you could help. I am using a blended model based on the StackingTransformer model with 4 base models. For some reason the features I created for the model in production are slightly different, by magnitude of e-7. This causes the prediction results to be very different. I've used random_state during data splitting, on the base models and on the StackingTransformer. Do you have any suggestion on why this high variability is happening and how to reduce it?
Thanks in any case!
Keren

Hi, thanks for using vecstack.

StackingTransformer does not introduce any randomness except cross-validation split which is controlled by random_state. So if your predicted values are slightly different this effect is probably related to your models. If model is able to generate identical predictions then StackingTransformer will also give you identical output. I would recommend checking each model without StackingTransformer i.e. to train and predict using each separate model to find out whether predictions are identical. Each model may have its own stochastic components and corresponding random states.

The script below demonstrates reproducibility of StackingTransformer:

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from vecstack import StackingTransformer

# Create data
X, y = make_classification(n_samples=500, n_features=5, 
                           n_informative=3, n_redundant=1, 
                           n_classes=4, flip_y=0, 
                           random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=0)

# Init models
estimators = [('rf', RandomForestClassifier(random_state=0, 
                                            n_jobs=-1, 
                                            n_estimators=100, 
                                            max_depth=3)),]

# Fit and predict 1st time
stack = StackingTransformer(estimators=estimators, 
                            regression=False,
                            stratified=True,
                            shuffle=True, 
                            random_state=0)
stack = stack.fit(X_train, y_train)
S_train = stack.transform(X_train)
S_test = stack.transform(X_test)

# Fit and predict 2nd time
stack_2 = StackingTransformer(estimators=estimators, 
                              regression=False,
                              stratified=True,
                              shuffle=True, 
                              random_state=0)
stack_2 = stack_2.fit(X_train, y_train)
S_train_2 = stack_2.transform(X_train)
S_test_2 = stack_2.transform(X_test)

# Compare
print((S_train == S_train_2).all()) # True
print((S_test == S_test_2).all()) # True

thank you Igor! this is very helpful