apache / datasketches-java

A software library of stochastic streaming algorithms, a.k.a. sketches.

Home Page:https://datasketches.apache.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Possible improvement for quick isEmpty check on sketches

adarshsanjeev opened this issue · comments

Recently, we found a performance issue while using data sketches through Apache Druid.

There was some slowness while running aggregations which merge HllSketches to get a final count. Looking through some flame graphs, a lot of the time seems to be spent checking if the sketches are empty.

In that particular data, it seems that there were a large number of empty sketches. This resulted in each sketch being deserialized from a byte array before calling an isEmpty() check. This is something that could be avoided since merging an empty sketch is a no-op, and a way to check if this is the case without first deserializing the sketch entirely might help here.

On adding a custom check on the byte array without deserializing (by checking the isEmpty flag in the header for sketch implementation, and number of elements byte for the set and list implementations) to check if the sketch is empty, and saw performance improvements (11 seconds to 9.7 seconds on 10M empty sketches being merged, with only this change).

Is there some scope for a check like this to be added to the data sketches library?

Are you doing a full deserialize or wrapping a chunk of memory? In at least some cases I believe there's a fast wrap that doesn't do validation of the image before letting you perform operations (like checking if empty). HLL is more complex than theta though, in terms of serialized form, so it may still need some checks to work properly.

I was out all last week, so I am just now reading this. I think Adarsh has a good point. Let me look into this.

@adarshsanjeev,
I'm not familiar with the HllSketchHolderObjectStrategy in Druid.

I think your "isSaveToConvertToNullSketch(ByteBuffer, int)" is too complicated. But I may not be aware of other issues you are trying to deal with, so the following suggestions may or may not be useful to you.

My recommended approach would be to wrap the sketch, which doesn't deserialize the sketch, and then you can use the HllSketch API to check for empty:

//All of our sketches require LittleEndian Byte Order
Memory mem = Memory.wrap(ByteBuffer, ByteOrder byteOrder);
HllSketch sk = HllSketch.wrap(mem);
if (sk.isEmpty()) { ... }

Another approach would be examining the binary as you are trying to do. But this can be fragile since the binary format can change.

It doesn't matter whether the sketch has been compacted or not, the following will hold true:

Given the following byte names and indices, starting from 0:

  • Family = index 2 //for HLL sketches this is == 7
  • Flags = index 5
  • ListCount = index 6
  • Mode = index 7

If an HLL sketch is empty, the following will be true:

  • Family == 7
  • Flags & 0x4 > 0 // an empty sketch should have the empty bit set
  • ListCount == 0 // an empty sketch always has a List Count == 0
  • Mode & 0x3 == 0 // an empty sketch is always in LIST mode

An empty sketch always has at least 8 bytes, if the size is < 8 bytes, it has been truncated outside our sketch code and is not a valid sketch.

One approach would be to examine the first 8 bytes in one operation:
long mask = 0x03_FF_04_00_00_07_00_00L;
if (first8bytes & mask == 0x4_00_00_07_00_00L) { empty } else { not empty}

I hope this helps.

@leerho Thanks for the comments!
For the first approach, going through the code, HllSketch.wrap(mem) has a lot of further checks, but I don't see anything very expensive getting called. I can confirm that and see if the performance is similar with that check.

For the binary checking approach, I believe isSaveToConvertToNullSketch implements something similar (checking the implemntation and then checking the bytes that specify the size for each of them separately).
I was not sure if the implementation would always be a list implementation, hence the extra checks for set and sketch implementations as well. I can remove this part of the code if it is guaranteed that it would be a list implementation internally for all empty sketches.

  1. HllSketch.wrap(mem) has more checks because it must handle the cases where the sketch is not empty. And even if it is empty it must create a valid Direct Sketch with a reference to the empty memory object. From my understanding of what your issue is, you just want to check if the sketch is empty and abort from that path as quickly as possible to look at the next sketch (for example).

  2. WRT the binary approach: This simple test assumes:

    • That your ByteBuffer image has the sketch image starting with byte 0 of the BB. I.e., there is nothing in front of the start of the sketch!
    • There has been no modification to the sketch image contained in the BB.
    • If this is true, then the simple test I gave you will always work.
  3. One more possible check: If, and only if, the capacity of the BB is exactly the size of the sketch image, AND the sketch has been serialized into its compact form; i.e., the sketch image was created by sketch.toCompactByteArray() then all you need to do is check the size of the ByteBuffer:
    if (byteBuffer.capacity() == 8) { /\*the sketch is guaranteed to be empty!\*/ }

This issue can now be closed.