apache / datasketches-java

A software library of stochastic streaming algorithms, a.k.a. sketches.

Home Page:https://datasketches.apache.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Off-by-one error when Serializing/Deserializing KllItemsSketch<Boolean>

ZacBlanco opened this issue · comments

While doing some implementation with KLL Sketches for Presto I found that I couldn't test the implementation of KLL Sketches with boolean values because it seems the sketch is deserialized incorrectly.

Here is a minimal example which inputs the stream [true, false, true, false, true, false, true, false, true, false] in order:

    @Test
    public void testBrokenSerialization()
    {
        Boolean[] items = IntStream.iterate(0, i -> i + 1)
                .limit(10)
                .mapToObj(i -> i % 2 == 0)
                .toArray(Boolean[]::new);
        KllItemsSketch<Boolean> sketch = KllItemsSketch.newHeapInstance(Boolean::compareTo, new ArrayOfBooleansSerDe());
        Arrays.stream(items).forEach(sketch::update);
        byte[] serialized = sketch.toByteArray();
        KllItemsSketch<Boolean> deserialized = KllItemsSketch.wrap(Memory.wrap(serialized), Boolean::compareTo, new ArrayOfBooleansSerDe());
        checkSketchesEqual(items, sketch, deserialized);
    }

    private static <T> void checkSketchesEqual(T[] items, KllItemsSketch<T> expected, KllItemsSketch<T> actual)
    {
        Arrays.stream(items).forEach(item -> assertEquals(actual.getRank(item), expected.getRank(item), 1E-8));
        for (double i = 0; i < 100.0; i++) {
            double rank = i / 100.0;
            assertEquals(actual.getQuantile(rank), expected.getQuantile(rank));
        }
    }

In the above example I only use 10 elements, so I expect them all to be retained. Thus, even after serialization and deserialization of the sketch, we should get equivalent results.

However, what I find is that this test fails with reported values from getRank(false) being 0.8 when it should be 0.5.

After digging into the code, the root cause seems to be an off-by-one error when calculating the location to read retained entries inside of KllDirectCompactItemsSketch#getTotalItemsArray. Apologies for the small screenshot

image

The takeaway from the screenshot is that the variable offset, which is passed as the starting point to serDe.deserializeFromMemory is set to 25, while it should be 26. I checked the serDe implementation logic and it works fine. So I believe it's the offset calculation here that is causing the retained entries array of booleans to be deserialized improperly to:

[true, false, false, false, false, false, false, false, false, true]

whereas it should be

[true, false, true, false, true, false, true, false, true, false]



Thank you for submitting this. I am looking into it.

This should have been fixed by #486

I think there's a 5.0.1 patch release being prepared.

The 5.01 patch release is now open for vote.

The following VOTE letter has details of this patch release. Anyone can vote.

https://lists.apache.org/thread/2odrgnt1wbsk6pkbfonr64v79lowpd85

Thank you for the swift fix and release cut! I am looking forward to including the new version once published so I can merge prestodb/presto#21568

I will test with the RC and provide a vote once I confirm the new boolean version passes my test suite.

v5.0.1 has been officially released, so now we can close this.