tensorflow / data-validation

Library for exploring and validating machine learning data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Jensen Shannon implementation

Alpha009 opened this issue · comments

What is taken as input here to find out Jensen shannon Divergence.
Is it Probabilities for the pandas column(numerical) or probability density function of the column?

Like in this code--

tfdv.get_feature(schema1, 'duration').drift_comparator.jensen_shannon_divergence.threshold = 0.01

The duration column here is first converted into what? Before feeding to find out the JS divergence value

@Alpha009 , thanks for bringing this up.
I feel like we need the pdf of the 'duration' column before feeding out the JS divergence value.
Let me forward this to @caveness.

Sorry for the delay on this. We use the standard histogram and calculate the JSD as shown here:

// JSD(P||Q) = (D(P||M) + D(Q||M))/2

Please feel free to reopen if more information is needed.