EpistasisLab / tpot

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

Home Page:http://epistasislab.github.io/tpot/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to map the features at the end of the pipeline back to the initial features

mayawz opened this issue · comments

initial num of features 581 but feature importance of the final pipeline has 587 features.
It looks like that at each of the 3 steps of the pipeline, the # of features increased from 581 -> 584 -> 587

Is there a way to map the 578 features at the end of the pipeline back to the original 581 features?

from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import make_pipeline, make_union
from tpot.builtins import StackingEstimator
from xgboost import XGBClassifier

exported_pipeline = make_pipeline(
StackingEstimator(estimator=XGBClassifier(learning_rate=0.01, max_depth=4, min_child_weight=6, n_estimators=100, n_jobs=1, subsample=0.15000000000000002, verbosity=0)),
StackingEstimator(estimator=GaussianNB()),
XGBClassifier(learning_rate=0.5, max_depth=2, min_child_weight=20, n_estimators=100, n_jobs=1, subsample=0.9000000000000001, verbosity=0)
)

exported_pipeline.fit(x_v, y_v)

trans_x_t = exported_pipeline[0].transform(x_t)
trans_x_t1 = exported_pipeline[1].transform(trans_x_t)

print(x_t.shape)
(677279, 581)
print(trans_x_t.shape)
(677279, 584)
print(trans_x_t1.shape)
(677279, 587)
exported_pipeline[-1].feature_importances_.shape
(587,)

The stacking estimator is defined here: https://github.com/EpistasisLab/tpot/blob/master/tpot/builtins/stacking_estimator.py

effectively, what it does is takes the predictions of the model and appends it to the left of the inputted data X. If its a classifier with predict_proba, the all class probabilities are also included. If you have a binary class, that means that there would be two additional columns, one for each class.

so in your case trans_x_t is [model 1 predicted labels, model 1 probability for class 0, model 1 probability for class 1, ]

similarly

trans_x_t1 would be [model 2 predicted labels, model 2 probability for class 0, model 2 probability for class 1, <trans_x_t>]