tensorflow / data-validation

Library for exploring and validating machine learning data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Jensen-Shannon divergence sensitive to number of samples

vinay-hebb opened this issue · comments

why does below code shows mismatch in distribution. Does Jensen-Shannon divergence sensitive to number of samples?

def show_anomalies(train_data, test_data):
    train_stats = tfdv.generate_statistics_from_dataframe(train_data)
    test_stats = tfdv.generate_statistics_from_dataframe(test_data)
    schema = tfdv.infer_schema(statistics=train_stats)
    for f in train_data.columns:
        tfdv.get_feature(schema, f).skew_comparator.jensen_shannon_divergence.threshold = 0.1
    skew_anomalies = tfdv.validate_statistics(train_stats, schema,
                                            serving_statistics=test_stats)
    tfdv.display_anomalies(skew_anomalies)
    return skew_anomalies

mu, sigma = 0, 0.1 # mean and standard deviation
a = pd.DataFrame({'RV': np.random.normal(mu, sigma, 1000)})
b = pd.DataFrame({'RV': np.random.normal(mu, sigma, 10)})
_ = show_anomalies(a, b)`

provides output

High approximate Jensen-Shannon divergence between training and serving | The approximate Jensen-Shannon divergence between training and serving is 0.429466 (up to six significant digits), above the threshold 0.1

Hi @vinay-hebb,

Kindly refer to link 1 and link 2 for more info on Jensen_Shannon divergence.

Also, you can refer to TFDV example notebook and see if this fixes your evaluation anomaly.

Closing this due to inactivity. Please take a look into the answers provided above, feel free to reopen and post your comments(if you still have queries on this). Thank you!