Jensen-Shannon divergence sensitive to number of samples
vinay-hebb opened this issue · comments
why does below code shows mismatch in distribution. Does Jensen-Shannon divergence sensitive to number of samples?
def show_anomalies(train_data, test_data):
train_stats = tfdv.generate_statistics_from_dataframe(train_data)
test_stats = tfdv.generate_statistics_from_dataframe(test_data)
schema = tfdv.infer_schema(statistics=train_stats)
for f in train_data.columns:
tfdv.get_feature(schema, f).skew_comparator.jensen_shannon_divergence.threshold = 0.1
skew_anomalies = tfdv.validate_statistics(train_stats, schema,
serving_statistics=test_stats)
tfdv.display_anomalies(skew_anomalies)
return skew_anomalies
mu, sigma = 0, 0.1 # mean and standard deviation
a = pd.DataFrame({'RV': np.random.normal(mu, sigma, 1000)})
b = pd.DataFrame({'RV': np.random.normal(mu, sigma, 10)})
_ = show_anomalies(a, b)`
provides output
High approximate Jensen-Shannon divergence between training and serving | The approximate Jensen-Shannon divergence between training and serving is 0.429466 (up to six significant digits), above the threshold 0.1
Hi @vinay-hebb,
Kindly refer to link 1 and link 2 for more info on Jensen_Shannon divergence.
Also, you can refer to TFDV example notebook and see if this fixes your evaluation anomaly.
Closing this due to inactivity. Please take a look into the answers provided above, feel free to reopen and post your comments(if you still have queries on this). Thank you!