apache / datasketches-java

A software library of stochastic streaming algorithms, a.k.a. sketches.

Home Page:https://datasketches.apache.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Inserting multiple elements at once

edmondliuTTD opened this issue · comments

Is it feasible to update quantile sketches (with a focus on KLL) with multiple values at once rather than one at a time, especially if those values are identical?

I've created a PR at #480 to explore this further. Some basic testing seems to suggest that there is an improvement.

@edmondliuTTD
Thank you for your contribution and you interest in our library. It is clear you have studied the KLL code somewhat, however, studying your changes for just a few minutes, I can see that you are having to touch several very critical areas of the code, which means we will have to study what you are attempting to do quite carefully. I am not discounting your idea, out-of-hand, but I do have some more questions for you:

Sketches are probabilistic state machines and in order to fully understand their behavior over a wide range of possible inputs we have to perform exhaustive testing that stresses the sketch with millions of trials.

  • So my first question is what testing have you done that could begin to establish that your changes do not violate the statistical properties of the KLL algorithm?

  • Have you tested update speed performance, space performance, and merge accuracy and speed performance?

  • It appears that you are not taking advantage of the logarithmic properties of the levels array. This means that your performance improvement may not be much better than externally feeding the sketch from a loop.

  • It appears that some of your changes are a matter of style and have nothing to do with your targeted changes, e.g., moving static imports from the top of the set of imports to the bottom. Or replacing our multiple specific static imports with "*" imports, which is not a good practice. This creates unnecessary review overload for us, as we have to ask, why did you do that?

  • We would like to know more about the specific problem you are trying to solve. And is the lack of being able to update multiple identical items at once a severe impediment? Can you give us some examples? How big is your total data set? what fraction of these are duplicates? How critical is the update speed performance?

I'm sure we will have more comments and questions as we dig deeper into this.
Cheers

@edmondliuTTD
Again, thank you for your interest in our library.

First, to answer your question above:

Is it feasible to update quantile sketches (with a focus on KLL) with multiple values at once rather than one at a time, especially if those values are identical?

Yes, it is feasible and it is called a "weighted quantiles sketch". And if based on the KLL algorithm it would be called a "KllWeightedSketch". But implementing it correctly does require a deeper understanding of how the KLL (or classic Quantiles) sketches work. Specifically, it needs to be implemented so that the update cost is O(log(m)), and not O(m), where m is the number of duplicates to be entered.

You are not the first person to request such a sketch, so we just might be able to get around to it in the near future.

Unfortunately, the PR you submitted does not qualify because your cost of updating m items is O(m). So I will be closing your PR.

Nonetheless, please stay in touch, because when we do implement a weighted quantiles sketch, you could help out in validating that it would work for you.

We would be interested to find out if you (or TTD) are using any of our other sketches and we would be grateful for any feedback.

Cheers,

This issue is now closed as it has been implemented via PR #480, #487, #488, #497, #498