tensorflow / data-validation

Library for exploring and validating machine learning data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Sort by "Distribution Distance" and "Non- uniformity"

tomlusz opened this issue · comments

Hi,

Using TFDV statistics results in the Facets there are very valueble (especially when I compare 2 datasets) "sort by" options : by "Distribution Distance" and by "Non - uniformity". However I cannot find anywhere descriptions how they works, based on which measures and algorithms.

I am aware that there are skew and drift anomalies and I can calculate them and compare two datasets. And when I sort the result of these anomalies they are gnerally quite similiar to "Distribution Distance" (more L-inifity or Jensen-Shanon then rather higher in sort by Distribution Distance). But not the same. It looks like this "sort by Distribution Distance" is cleverer and indedd show the most important differences and sort by them in valuable way. Does anyone know how is it calcucaled ? I assume it must be calculated in the Facet - but how ?

Similar issue with sort by ""Non- uniformity" - how Facet calculate it ? I do not find any measure of "uniformity" in the statistics or schema results. The Facet must calucalte it from the statitists - but how exactly ? Does anyone know ?

I cannot find these informations in this repo even in the code - because as I wrote it must be calculated by the facet. Is there other repo for the facet and perhaps I can find there the description ?

My goal is tu calculate measures used in sort by "Distribution Distance" and "Non- uniformity" and saved them for further analyzes and comparasions.

Thanks.

Closing this issue in light of the comment above. Please reopen if this does not address your question. Thanks!