TFDV uses weird float value for sample_count of generated histograms

Question

TFDV uses weird float value for sample_count of generated histograms

liwii opened this issue 3 years ago · comments

When I generate statistics from a .tfrecord file with generate_statistics_from_tfrecord, its histograms contain weird float values as the sample_counts of the buckets.
For example, in one bucket which is supposed to contain 10 samples, sample_count: 9.94000000834465 is used instead. How can I set the exact integer sample_count for each bucket?

Here's a Colab to reproduce.

Kenny Song · Answer 1 · Fri Sep 24 2021 11:26:03 GMT+0800 (China Standard Time)

Is there any update (or explanation) for this behavior?

Paul Suganthan · Answer 2 · Sat Sep 25 2021 03:08:17 GMT+0800 (China Standard Time)

TFDV currently uses an approximate method to determine the bucket boundaries in a single pass. The float values are due to this. One option would be to do some post-processing to round the values.

Kenny Song · Answer 3 · Sat Sep 25 2021 13:14:16 GMT+0800 (China Standard Time)

Got it, thanks for the explanation. Are there any error bounds on the approximate counts? (i.e. it's within +-1 of the true count)