apache / datasketches-java

A software library of stochastic streaming algorithms, a.k.a. sketches.

Home Page:https://datasketches.apache.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Sketches for Histogram and NDV

zhuwenzhuang opened this issue · comments

Hi, I found HIVE-26221 uses the KLL float sketch as the histogram implementation. It works for range predicate selectivity. But float sketch cannot calculate NDV in an arbitrary range of rank.
Do we support any Sketches that can calculate any range's NDV?

I don't think so

Sketches are great for answering specific questions, but their size and speed benefits generally come at the cost of less flexibility to answer other questions.

KLL relies only on a comparator between elements. There's no consideration given to duplicates. The short answer is what @AlexanderSaydakov mentioned above.

One possible idea -- with a huge caveat that we can't say much about error bounds -- would be to have a tuple sketch containing the raw values in the tuple summary. You could query the KLL sketch to get the values associated with the rank range boundaries and then filter the the tuple summaries to be within the range. The number of retained values divided by theta would be an estimate of distinct values within the range. But, again, any error bounds produced by the sketch would be misleading since we don't know what they'd be using an approximation of an approximation.

If your data is in Hive and you are willing to allow two passes on your data you could use KLL to establish the histogram boundaries you are interested in on the first pass, and then on the second pass feed an array of HLL sketches corresponding to the histogram ranges that would do distinct counts filtered for each range. This is a little clumsy, but would provide reliable accuracy bounds based on the HLL configuration. This avoids the kind of approximation of approximations issue @jmalkin mentioned.

Of course, the resulting histogram boundaries are also approximations, but at least you would have independent control of the accuracy of the boundaries and the accuracy of the NDV of each of the bins separately :)

Here is another, but very crude solution. If you just want a very rough idea of what the NDVs are per bin, you could do this:
From the histogram information produced by the KLL sketch, you can compute the fractional density of each bin (fraction of total values including duplicates). Then with a parallel HLL sketch counting NDV of the entire stream you can compute the fractional number of duplicates in the stream. Finally, with the huge assumption that the duplicates are roughly uniformly distributed across the ranks, you can guess-timate the number of NDV in each bin.

(I put this in not just for its humor value, but this is almost exactly what political pollsters do!)

Thanks !
I'm trying
1.(first table scan) percentile_approx for bin's bound_array.
2.(second table scan) array_lower_bound_index(bound_array, data_col) for data_col's bin id and data_col pair.
3.(from on second step result) compute ndv,min,max,cnt for data_col group by bin id.

I think this issue has been addressed. So I'm going to close this issue.
If you have any further questions you can reopen this issue.