Keeping track of feature names after applying onehot_categorizer
vultor33 opened this issue · comments
PATH: .\fklearn\src\fklearn\training\transformation.py : onehot_categorizer
Instructions
When onehot_categorizer is applied, some columns are deleted and others are created.
However, we need the final column names for machine learning models.
These names are accessible by hard code, but I think hard code is not fkleonic.
Describe the feature and the current state.
onehot function can be created by the following code:
from fklearn.training.transformation import onehot_categorizer
onehot_function = onehot_categorizer(columns_to_categorize = ['categorical_feature'])
Inside the pipeline, "onehot_function" will be applied to data and new columns will be created.
I couldn't find any simple way to find the new columns, so, I had created a method to address this and it is enough for me, but I think an official one is needed.
Proposed solution
At "transformation.py" file, I had added the following code:
def update_onehot_feature_names(features: List[str],
log: Dict[str,Dict]) -> List[str]:
""" Update feature names that were created by onehot_categorizer
Parameters
----------
features : list of str
A list of column names that are used as features for the model. All this names
should be in `df`.
log: dict
Log of onehot_categorizer applied in training data.
Must be training data, because test data could have categories that aren't previously known.
Returns
----------
new_features: list of str
A list of column names with original categorical features deleted
and new onehot columnnames added.
"""
new_features = list(features)
for feature_updated in log['onehot_categorizer']['mapping']:
if feature_updated not in new_features:
raise Exception(str(feature_updated) + ' not found in features list')
if log['onehot_categorizer']['hardcode_nans']:
new_features += [feature_updated + '==nan']
new_features.remove(feature_updated)
for one_hot_column_key in log['onehot_categorizer']['mapping'][feature_updated]:
new_features += [feature_updated + '==' + one_hot_column_key]
return new_features
setattr(onehot_categorizer,'update_features', update_onehot_feature_names)
TEST CODE
import random
import pandas as pd
from fklearn.training.transformation import onehot_categorizer
# GENERATE DATA
column1 = [random.choice(['a','b','c']) for x in range(100)]
column2 = [random.random() for x in range(100)]
training_data = pd.DataFrame({'categorical_feature' : column1, 'numerical_feature' : column2})
FEATURES = training_data.columns.tolist()
print(training_data.head(5))
# NEW FUNCTIONALITY TEST
print('Initial FEATURES: ', FEATURES)
oneHot = onehot_categorizer(columns_to_categorize = ['categorical_feature'],store_mapping = True)
_, _, log = oneHot(training_data) # You can only define extra columns if you 'see' the training_data
NEW_FEATURES = onehot_categorizer.update_features(FEATURES, log)
print('Final FEATURES: ', NEW_FEATURES)
Test expected result
Initial FEATURES: ['categorical_feature', 'numerical_feature']
Final FEATURES: ['numerical_feature', 'categorical_feature==a', 'categorical_feature==b', 'categorical_feature==c']
Will this change a current behavior? How?
This will not change the current behaviour, will just simplify onehot_categorizer use.
Duplicate of #58