nubank / fklearn

fklearn: Functional Machine Learning

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Keeping track of feature names after applying onehot_categorizer

vultor33 opened this issue · comments

PATH: .\fklearn\src\fklearn\training\transformation.py : onehot_categorizer

Instructions

When onehot_categorizer is applied, some columns are deleted and others are created.
However, we need the final column names for machine learning models.
These names are accessible by hard code, but I think hard code is not fkleonic.

Describe the feature and the current state.

onehot function can be created by the following code:

from fklearn.training.transformation import onehot_categorizer
onehot_function = onehot_categorizer(columns_to_categorize = ['categorical_feature'])

Inside the pipeline, "onehot_function" will be applied to data and new columns will be created.
I couldn't find any simple way to find the new columns, so, I had created a method to address this and it is enough for me, but I think an official one is needed.

Proposed solution

At "transformation.py" file, I had added the following code:

def update_onehot_feature_names(features: List[str], 
                            log: Dict[str,Dict]) -> List[str]:
    """ Update feature names that were created by onehot_categorizer 
	Parameters
        ----------
	features : list of str
		A list of column names that are used as features for the model. All this names
		should be in `df`.
	log: dict
		Log of onehot_categorizer applied in training data.
		Must be training data, because test data could have categories that aren't previously known.

	Returns
        ----------
	new_features: list of str
		A list of column names with original categorical features deleted
		and new onehot columnnames added.
    """

    new_features = list(features)
    for feature_updated in log['onehot_categorizer']['mapping']:
        if feature_updated not in new_features:
            raise Exception(str(feature_updated) + ' not found in features list')

        if log['onehot_categorizer']['hardcode_nans']:
            new_features += [feature_updated + '==nan']

        new_features.remove(feature_updated)
        for one_hot_column_key in log['onehot_categorizer']['mapping'][feature_updated]:
            new_features += [feature_updated + '==' + one_hot_column_key]
    return new_features

setattr(onehot_categorizer,'update_features', update_onehot_feature_names)

TEST CODE

import random
import pandas as pd
from fklearn.training.transformation import onehot_categorizer

# GENERATE DATA
column1 = [random.choice(['a','b','c']) for x in range(100)]
column2 = [random.random() for x in range(100)]
training_data = pd.DataFrame({'categorical_feature' : column1, 'numerical_feature' : column2})
FEATURES = training_data.columns.tolist()
print(training_data.head(5))

# NEW FUNCTIONALITY TEST
print('Initial FEATURES:  ', FEATURES)
oneHot = onehot_categorizer(columns_to_categorize = ['categorical_feature'],store_mapping =  True)
_, _, log = oneHot(training_data)  # You can only define extra columns if you 'see' the training_data
NEW_FEATURES = onehot_categorizer.update_features(FEATURES, log)
print('Final FEATURES:  ', NEW_FEATURES)

Test expected result

Initial FEATURES: ['categorical_feature', 'numerical_feature']
Final FEATURES: ['numerical_feature', 'categorical_feature==a', 'categorical_feature==b', 'categorical_feature==c']

Will this change a current behavior? How?

This will not change the current behaviour, will just simplify onehot_categorizer use.