tensorflow / data-validation

Library for exploring and validating machine learning data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Infer multivalent features with tfdv from pandas dataframe does not work

pppmlt opened this issue · comments

I want to infer a schema with tensorflow data validation (tfdv) based on a pandas dataframe of the training data. The dataframe contains a column with a multivalent feature, where multiple values (or None) of the feature can be present at the same time. The goal is to used the inferred schema and its domain to find anomalies in serving data (e.g., a previously unseen value of the multivalent feature)

Given the following dataframe:

df = pd.DataFrame([{'feat_1': 13, 'feat_2': 'AA, BB', 'feat_3': 'X'},
                   {'feat_1': 7, 'feat_2': 'AA', 'feat_3': 'Y'},
                   {'feat_1': 7, 'feat_2': None, 'feat_3': None}])

inferring and displaying the schema results in:

image

Thus, tfdv treats the 'feat_2' values as a single string instead of splitting them at the ',' to produce a domain of 'AA', 'BB':

image

If if save the values of feature as e.g., ['AA', 'BB'], the schema inference throws an error:

ArrowTypeError: ("Expected bytes, got a 'list' object", 'Conversion failed for column feat_2 with type object')

Is there any way to achieve this with tfdv?

Hi -- TFDV does not currently support generating statistics from a pandas DataFrame with columns containing multivalent features. Only columns with primitive types are supported. Because schema inference is based on stats, that means we don't support schema inference on such data either. We've updated the help for generate_statistics_from_dataframe to note this limitation. Should we support multivalent features in this function in the future, we will update that documentation accordingly.