apache / datasketches-java

A software library of stochastic streaming algorithms, a.k.a. sketches.

Home Page:https://datasketches.apache.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question related to union performance

Cheappie opened this issue · comments

Hi, I just wanted to ask whether CpcSketch is fastest Sketch for performing union ? And whether is there other way to speed up unions than decreasing logK param ?

`

static class UnionPerfTest {
    public static void main(String[] args) {

        int logK = 11;
        Random r = new Random();

        CpcSketch fllcpc = new CpcSketch(logK);
        IntStream.generate(() -> r.nextInt(10_000_000)).distinct().limit(100_000).forEach(fllcpc::update);

        CpcSketch sllcpc = new CpcSketch(logK);
        IntStream.generate(() -> r.nextInt(10_000_000)).distinct().limit(1_000_000).forEach(sllcpc::update);

        long blackHole = 0;
        long start = System.currentTimeMillis();
        for (int i = 0; i < 1_000_000; i++) {
            CpcUnion cpcUnion = new CpcUnion(logK);

            cpcUnion.update(sllcpc);
            cpcUnion.update(fllcpc);

            blackHole += cpcUnion.getResult().getEstimate();
        }

        System.out.println("Millis: " + (System.currentTimeMillis() - start));
        System.out.println("bc: " + blackHole);
    }
}

`

@Cheappie
We did a characterization study on this question which resulted in the last plot on the CpcPerformance page on our website. And this shows that our current implementation of CPC is faster than HLL, which is faster than Theta for both Java and C++. In addition C++ is faster than Java, as you would expect.

However, this particular characterization test held the number of sketches to be merged as a constant (32) and varied the total number of uniques equally distributed across the 32 sketches. There are numerous other ways to construct a merge speed test. I would suggest constructing a test that most closely resembles how merges are actually performed in your application.

That being said, your particular test includes the time to create the union in addition to the merges and executes only two merges. To reduce the influence of the union construction I would recommend doing a lot more than just 2 update operations.

In our back-end systems, a single union might be merging thousands to millions of sketches. Also, it is not typical for all sketches to have the same cardinality, which we assumed in our test. Your test used two different cardinalities. In the environments we encounter the sketch cardinalities tend to vary with a power-law distribution where only a few sketches are very large and then millions of sketches have very small cardinalities. Unfortunately, constructing meaningful tests for this kind of distribution is challenging because there are so many variables.

Nonetheless, the reason this is interesting is the Theta union operation has an "early-stop" feature that should speed up the unioning process considerably as the number of sketches in the merge loop gets large. This might change the ordering of the performance ranking. Constructing such a test is something we have been wanting to do but haven't yet gotten around to. :)

@leerho
Thank you for such detailed answer. After playing with sketches, benchmarking a bit, and seeing charts I decided to go with CPC. The reason why I do 2 update op's per union is that I want to exclude dissimilar characteristics before performing search for exact counts.