How to map the features at the end of the pipeline back to the initial features

Question

How to map the features at the end of the pipeline back to the initial features

mayawz opened this issue 10 months ago · comments

initial num of features 581 but feature importance of the final pipeline has 587 features.
It looks like that at each of the 3 steps of the pipeline, the # of features increased from 581 -> 584 -> 587

Is there a way to map the 578 features at the end of the pipeline back to the original 581 features?

from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import make_pipeline, make_union
from tpot.builtins import StackingEstimator
from xgboost import XGBClassifier

exported_pipeline = make_pipeline(
StackingEstimator(estimator=XGBClassifier(learning_rate=0.01, max_depth=4, min_child_weight=6, n_estimators=100, n_jobs=1, subsample=0.15000000000000002, verbosity=0)),
StackingEstimator(estimator=GaussianNB()),
XGBClassifier(learning_rate=0.5, max_depth=2, min_child_weight=20, n_estimators=100, n_jobs=1, subsample=0.9000000000000001, verbosity=0)
)

exported_pipeline.fit(x_v, y_v)

trans_x_t = exported_pipeline[0].transform(x_t)
trans_x_t1 = exported_pipeline[1].transform(trans_x_t)

print(x_t.shape)
(677279, 581)
print(trans_x_t.shape)
(677279, 584)
print(trans_x_t1.shape)
(677279, 587)
exported_pipeline[-1].feature_importances_.shape
(587,)

Pedro Ribeiro · Answer 1 · Tue Nov 07 2023 04:11:25 GMT+0800 (China Standard Time)

The stacking estimator is defined here: https://github.com/EpistasisLab/tpot/blob/master/tpot/builtins/stacking_estimator.py

effectively, what it does is takes the predictions of the model and appends it to the left of the inputted data X. If its a classifier with predict_proba, the all class probabilities are also included. If you have a binary class, that means that there would be two additional columns, one for each class.

so in your case trans_x_t is [model 1 predicted labels, model 1 probability for class 0, model 1 probability for class 1, ]

similarly

trans_x_t1 would be [model 2 predicted labels, model 2 probability for class 0, model 2 probability for class 1, <trans_x_t>]