Sort by "Distribution Distance" and "Non- uniformity"

Question

Sort by "Distribution Distance" and "Non- uniformity"

tomlusz opened this issue 3 years ago · comments

Hi,

Using TFDV statistics results in the Facets there are very valueble (especially when I compare 2 datasets) "sort by" options : by "Distribution Distance" and by "Non - uniformity". However I cannot find anywhere descriptions how they works, based on which measures and algorithms.

I am aware that there are skew and drift anomalies and I can calculate them and compare two datasets. And when I sort the result of these anomalies they are gnerally quite similiar to "Distribution Distance" (more L-inifity or Jensen-Shanon then rather higher in sort by Distribution Distance). But not the same. It looks like this "sort by Distribution Distance" is cleverer and indedd show the most important differences and sort by them in valuable way. Does anyone know how is it calcucaled ? I assume it must be calculated in the Facet - but how ?

Similar issue with sort by ""Non- uniformity" - how Facet calculate it ? I do not find any measure of "uniformity" in the statistics or schema results. The Facet must calucalte it from the statitists - but how exactly ? Does anyone know ?

I cannot find these informations in this repo even in the code - because as I wrote it must be calculated by the facet. Is there other repo for the facet and perhaps I can find there the description ?

My goal is tu calculate measures used in sort by "Distribution Distance" and "Non- uniformity" and saved them for further analyzes and comparasions.

Thanks.

caveness · Answer 1 · Tue Aug 31 2021 00:30:10 GMT+0800 (China Standard Time)

Hi – Yes, there is a separate repo for Facets. I believe the information you’re looking for is available here:

Non-uniformity:
https://github.com/PAIR-code/facets/blob/4742b8b93c2dacf22fc8ace2cee42dd06382c48e/facets_overview/components/facets_overview/facets-overview.ts#L195
https://github.com/PAIR-code/facets/blob/4742b8b93c2dacf22fc8ace2cee42dd06382c48e/facets_overview/common/utils.ts#L123

Distribution distance:
https://github.com/PAIR-code/facets/blob/4742b8b93c2dacf22fc8ace2cee42dd06382c48e/facets_overview/components/facets_overview/facets-overview.ts#L219
https://github.com/PAIR-code/facets/blob/4742b8b93c2dacf22fc8ace2cee42dd06382c48e/facets_overview/common/utils.ts#L281

caveness · Answer 2 · Tue Sep 07 2021 23:47:32 GMT+0800 (China Standard Time)

Closing this issue in light of the comment above. Please reopen if this does not address your question. Thanks!