apache / datasketches-java

A software library of stochastic streaming algorithms, a.k.a. sketches.

Home Page:https://datasketches.apache.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CompactSketch ArrayIndexOutOfBoundsException

xinyuwan opened this issue · comments

Hi, we are using Theta Sketches java library to calculate reach metrics. Based on the Java Example from the Data Sketch website, we are using Union to join multiple sketches and then get the CompactSketch in binary format.

However, we do observe issues when we get CompactSketch from Union as the following stacktrace:

Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 137 out of bounds for length 137 
at org.apache.datasketches.theta.CompactSketch.compactCache(CompactSketch.java:97) 
at org.apache.datasketches.theta.UnionImpl.getResult(UnionImpl.java:238) 
at org.apache.datasketches.theta.UnionImpl.getResult(UnionImpl.java:212) 

Can you guys let us know under what case this would happen and what's the root cause?

Thanks,
Bill

Please provide the library version you're using and a code snippet so we can try to reproduce the error.

Unfortunately, this information is not sufficient to give us much clue as to what is going on. Please send us a small java program that reproduces this problem and we will be glad to help you.

Cheers.

Thanks @jmalkin @leerho for the quick response. Let me add more context to the issue:

Problem: we encounter this ArrayOOB exception non-deterministically. The same input may fail once and succeed later and I cannot reproduce the error from local when I do individual calls to the getCompactSketch().
Library version: org.apache.datasketches:datasketches-java:1.3.0-incubating
Use case: Here is a description on how we are using the sketch:

  1. We are aggregating reach metrics from minute granularity to hourly and then to daily granularity. We do this by inserting UUIDs into UpdateSketch and serialize the compact form of it into Protobuf ByteString.
  2. In the minute-to-hour and hour-to-day aggregation, we are deserializing the ByteString back to Sketch and Union them.
  3. Once all minutse of one hour(or hours of one day) are all updated to the Union, we call Union.getResult() and serialize it into Protobuf ByteString again. The error only occurs during hour-to-day Union.getResult() and non-deterministically (Not sure if this is because the size of the sketch to be merged to Union is larger at this time). The error rate is about 5% of the total requests.
  4. Throughout the aggregation, we use Norminal Entries (K) = 1024 for both UpdateSketch and Union.

Here is some code snippet:

  1. We have SingleEntityUnionAccumulator which takes a counter key enum and ByteString of the sketch (compacted update sketch)
public class SingleEntityUnionAccumulator {

    final SketchOperations sketchOperations;
    final Map<AdImpressionStatsCounter, ReachData> reach;

    public SingleEntityUnionAccumulator(@Nonnull final SketchOperations sketchOperations) {
        super(sketchOperations);
    }

    public void accumulate(
            final AdImpressionStatsCounter counterKey, final ByteString sketchBytes) {
        ReachData reachData = putReachDataIfAbsent(counterKey);
        Sketch sketch = sketchOperations.byteStringToSketch(sketchBytes);
        reachData.getUnion().update(sketch);
    }

    public Optional<SingleEntityReachData> toReachData() {
        // return null if there is no reach data to write to BT
        if (MapUtils.isEmpty(this.getReach())) {
            return Optional.empty();
        }

        Map<Integer, ReachDataEntry> dataEntryMap =
                this.getReach().entrySet().stream()
                        .collect(
                                toMap(
                                        e -> e.getKey().getHash(),
                                        e -> e.getValue().toReachDataEntry()));
        return Optional.of(
                SingleEntityReachData.newBuilder()
                        .putAllDataEntryByCounter(dataEntryMap)
                        .setEntityHierarchy(this.getHierarchy())
                        .build());
    }
}

2 The toReachData() method is where we see the exception throwing from. Specifically, e -> e.getValue().toReachDataEntry())) which calls Union.getResult()

    public ReachDataEntry toReachDataEntry() {
        return ReachDataEntry.newBuilder()
                .setSketch(sketchOperations.sketchToByteString(getCompactSketch()))
                .setSeedValue(seedValue)
                .build();
    }

    public CompactSketch getCompactSketch() {
        return this.union.getResult();
    }

  1. The SketchOperations is a helper class doing all the SerDe of sketch and union. In this case:
    @Override
    public ByteString sketchToByteString(final Sketch sketch) {
        return ByteString.copyFrom(sketch.compact().toByteArray());
    }

    @Override
    public Sketch byteStringToSketch(final ByteString sketchBytes) {
        return Sketches.wrapSketch(Memory.wrap(sketchBytes.toByteArray()));
    }

I'm not sure if ArrayIndexOOB indicates that something wrong on the memory/heap side, but can you guys let us know if this can be a cause during the Union.getResult()?

So the crucial part was missing in the original message that you do this in a massively parallel system, which is still unnamed. I guess it must be Spark. Most probably this is because of multithreaded execution of some parts of this code. That should explain sporadic failures. The Datasketches library is not thread-safe. If your code is multithreaded, you need to take care of proper locking where needed.

@leerho @AlexanderSaydakov thanks a lot for pointing to the right direction. It is indeed a multi-thread issue where we have 2 thread one trying to call getCompactSketch() to write it out to external storage, and the other thread trying to derive a new sketch by calling getCompactSketch() from another thread and these 2 must have race conditions. By sequencing the 2 steps in single thread the problem is gone. I just want to confirm that if I call union.getResult() multiple times in sequential order it is fine right?
Basically if

  1. union.getResult() returns a sketch having an estimate of 10
  2. then I cann union.getResult() again, it should return 10 again
  3. and if I then update union with 1 more entry, and call union.getResult(), it should give me 11 right?

Unless some other thread is modifying the union at the same time you are calling getResult()