faster data cleaning
kengz opened this issue · comments
Can we have 2 functions commonly used for data cleaning: fillna()
and LabelEncoder()
, but implement a Multi-column version for each that works directly on the entire data frame X
rather than column-by-column.
MultiFillna(X, str_val='NA', num_val=0)
would perform column-wise fillna()
on X using the stated/default values, 'NA' for string columns and 0 for numerical columns. This is especially useful when we have X with a mix-match of str/number columns and wish to do fillna()
in one go.
MultiLabelEncoder
is especially useful for applying fit_transform
to each column with mentioned header, and its reverse_transform
would apply the inverse. This can be saved with the model at classifier.save(path)
, and restored for direct usage with classifier.restore(path)
.
For example, for the titanic data, one can do prediction by loading the model with the MultiLabelEncoder
, and input x=['male', 22, 1, 7.25]
, then do predict(x)
that internally uses the encoder to transform x
.
This is actually possible to do now with FeatureColumns (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/layers/python/layers/feature_column.py#L34) Specifically see sparse_column_with_keys.
Let us know how it works (you can use tracker for https://github.com/tensorflow/tensorflow/).