faster data cleaning

Question

faster data cleaning

kengz opened this issue 8 years ago · comments

Can we have 2 functions commonly used for data cleaning: fillna() and LabelEncoder(), but implement a Multi-column version for each that works directly on the entire data frame X rather than column-by-column.

MultiFillna(X, str_val='NA', num_val=0) would perform column-wise fillna() on X using the stated/default values, 'NA' for string columns and 0 for numerical columns. This is especially useful when we have X with a mix-match of str/number columns and wish to do fillna() in one go.

MultiLabelEncoder is especially useful for applying fit_transform to each column with mentioned header, and its reverse_transform would apply the inverse. This can be saved with the model at classifier.save(path), and restored for direct usage with classifier.restore(path).

For example, for the titanic data, one can do prediction by loading the model with the MultiLabelEncoder, and input x=['male', 22, 1, 7.25], then do predict(x) that internally uses the encoder to transform x.

Illia Polosukhin · Answer 1 · Mon Jun 13 2016 00:34:43 GMT+0800 (China Standard Time)

This is actually possible to do now with FeatureColumns (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/layers/python/layers/feature_column.py#L34) Specifically see sparse_column_with_keys.

Let us know how it works (you can use tracker for https://github.com/tensorflow/tensorflow/).