apache / datasketches-java

A software library of stochastic streaming algorithms, a.k.a. sketches.

Home Page:https://datasketches.apache.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Vectorized KLL sketch updates

fanyang01 opened this issue · comments

First of all, thanks for the great work on this high-quality library!

We are using the KLL sketch to estimate quantiles in a new columnar database system and it works fantastically well! We have found that the sketch update is one of the most expensive operations in the system for data writing, and are wondering if there are any opportunities for vectorization. In columnar database systems, data comes in batches and is stored in arrays, each of which usually has a few thousand elements. We are wondering if it is possible to vectorize the KLL sketch update operation to take advantage of the array-based data layout.

Taking KllDoubleSketch#update as an example, the update method is implemented as follows:

public void update(final double item) {
if (Double.isNaN(item)) { return; } //ignore
if (readOnly) { throw new SketchesArgumentException(TGT_IS_READ_ONLY_MSG); }
KllDoublesHelper.updateDouble(this, item);
kllDoublesSV = null;
}

and KllDoublesHelper#updateDouble is implemented as follows:

static void updateDouble(final KllDoublesSketch dblSk, final double item) {
dblSk.updateMinMax(item);
int freeSpace = dblSk.levelsArr[0];
assert (freeSpace >= 0);
if (freeSpace == 0) {
compressWhileUpdatingSketch(dblSk);
freeSpace = dblSk.levelsArr[0];
assert (freeSpace > 0);
}
dblSk.incN();
dblSk.setLevelZeroSorted(false);
final int nextPos = freeSpace - 1;
dblSk.setLevelsArrayAt(0, nextPos);
dblSk.setDoubleItemsArrayAt(nextPos, item);
}

Is it possible to add a vectorized version of update that takes an array of values and vectorizes the operations as much as possible? For example, could the update method be implemented as follows

  @Override
  public void update(final double[] items, final int offset, final int length) {
    if (readOnly) { throw new SketchesArgumentException(TGT_IS_READ_ONLY_MSG); }
    boolean hasNaN = false;
    boolean allNaNs = true;
    for (int i = 0; i < length; i++) {
      final boolean isNaN = Double.isNaN(items[offset + i]);
      hasNaN |= isNaN;
      allNaNs &= isNaN;
    }

    if (allNaNs) { return; }
    else if (!hasNaN) { // fast path
      kllDoublesHelper.updateDouble(this, items, offset, length);
    }
    else {
      for (int i = 0; i < length; i++) {
        final double v = items[offset + i];
        if (!Double.isNaN(v)) {
          kllDoublesHelper.updateDouble(this, v);
        }
      }
    }
    kllDoublesSV = null;
  }

and KllDoublesHelper#updateDouble be implemented as follows

  static void updateDouble(final KllDoublesSketch dblSk, final double[] items,
                           final int offset, final int length) {
    dblSk.updateMinMax(items, offset, length); // Vectorized min/max update, trivial
    int count = 0;
    while (count < length) {
      if (dblSk.levelsArr[0] == 0) {
        compressWhileUpdatingSketch(dblSk);
      }
      final int spaceNeeded = length - count;
      final int freeSpace = dblSk.levelsArr[0];
      assert (freeSpace > 0);
      final int numItemsToCopy = Math.min(spaceNeeded, freeSpace);
      final int dstOffset = freeSpace - numItemsToCopy;
      System.arraycopy(items, offset + count, dblSk.doubleItems, dstOffset, numItemsToCopy); // For KllHeapDoublesSketch
      count += numItemsToCopy;
      dblSk.incN(numItemsToCopy);
      dblSk.setLevelZeroSorted(false);
      dblSk.setLevelsArrayAt(0, dstOffset);
    }
  }

?

If this is valid, I would be happy to contribute to the implementation.

Cheers,
Fan Yang

This is an interesting idea, but it's not yet obvious to me that it'd provide a major speedup? As currently written, it's iterating through the input array at least 3 times -- once to compute the NaN flags, once for the min/max updates, and then again for the chunked copy. And for the case where there are NaNs, it's doing that check twice while still using the original route (so it'd be a little slower than the existing method).

This is a case where some benchmarking experiments would be very helpful.

Thank you for your suggestions and willingness to contribute! I also really appreciate the feedback on how our sketches are working (or not working! ) in your systems. Your suggestion is written very clearly and is easy to understand what you are trying to do. --Thank you!

I have some questions about your proposed implementation. As Jon just mentioned, your vectorized input proposal has potentially three loops: the first to detect NaNs, the second to possibly use a faster update path if no NaNs are detected in the input array, and the third loop used only if there is a Nan in the array.

Is there a reason that you feel this is faster than doing just one loop? I.e., loop on the input array, and if there is a NaN, skip it and go to the next. If it is non-NaN, just use the regular internal updateDouble(dblSk, item). Done!

There are at least three probabilistic factors in play here.

The first factor is the available freeSpace in the internal itemsArray tracked by levelsArr[0], This can be quite random and quickly becomes smaller and smaller as the length of your stream increases (This is a critical part of the KLL algorithm). Independent of how large your factor K is, as the stream gets longer and longer it will reduce down to at most 8 items immediately after compressWhileUpdatingSketch(dblSk) is called. 8 items is not a lot of free space!

The second factor is the distribution of vector lengths coming from your database. Large databases tend to have long column lengths as well (often in the many millions). This means that no matter how much optimization you try to do with your vector bulk update method, the update time will be dominated by the compression step. And, given lots of long vectors, the compression may end up being called multiple times for a single vector call.

The third is the density of NaNs in your data and, again, only you know what that is.

Again, as Jon suggests, what would be really helpful here would be for you to do a characterization study.

For example, In your own fork, modify the code as you have proposed and then characterize the update speed, using typical data from your database. Then compare it with either an external loop outside the sketch or with just the simple one-loop public vector update method as I mentioned above. Because of all the probabilistic factors, in order to get meaningful results you will need to run lots of trials using randomized sequences of your data.

While doing this it would also be useful for you to characterize the distribution of vector lengths (using another KLL sketch), the distribution of stream lengths (if different, also using KLL), and the density of NaNs in your data (using simple counters). Correlating these results with the update speed test should give you much deeper insight into your data, and possibly where to focus our attention on improving our KLL implementation. :)

And most important, keep in touch and let us know your progress!

Cheers,
Lee.

A couple more questions.

I note that your current use case leverages the KllDoublesSketch. Do you have applications for the generic KllItemsSketch or KllFloatsSketch as well?

What kind of quantile (or rank) queries do you do? For example, are you most interested in the extremes, like the 99.9 percentile or do you query the whole distribution of ranks, example ( 10, 20, ... 90) percentiles to understand the full shape of the distribution?

For example, if you are primarily interested in the extremes, you might be interested in the REQ sketch.

Thanks for your insightful and informative feedback! I will do some benchmarking experiments and get back to you. Below are some answers to your questions. Let's start with the use case.

Do you have applications for the generic KllItemsSketch or KllFloatsSketch as well? What kind of quantile (or rank) queries do you do?

We use all three sketches in our database system. The SQL query optimizer needs to estimate the number of rows that satisfy a given predicate, for example, col >= 10 and col <= 20. We maintain a KLL sketch for each column and use the sketch to estimate the number of rows that satisfy the predicate. For example, if the estimated rank of 10 is 10% and the estimated rank of 20 is 20%, then the estimated number of rows that satisfy the predicate is (20% - 10%) * total number of rows. Therefore, KLL sketches are exactly what we need, as they can be roughly viewed as equi-depth histograms and provide PAC accuracy guarantees for the whole range of quantiles. Given that there are many data types in a database, we choose the most proper sketch type for each column. For example, we use KllFloatsSketch and KllDoublesSketch for numeric columns, and KllItemsSketch for string/binary columns. We also investigated the REQ sketch, but its accurate extreme estimation is not what we need currently.

We also use the sketches to do approximate sorting, which is an exciting application. In database systems, full sorting is expensive. Sometimes, approximate sorting is enough or could be used to speed up the full sorting ( https://dl.acm.org/doi/pdf/10.1145/3318464.3389752 ). In one of our current explorations, we use the KLL sketch to estimate a sequence of equispaced quantiles of a column, use the quantiles to partition the rows into buckets, and then process each bucket separately. According to our experiments, this approach is much faster than full sorting. The additive accuracy guarantee of KLL sketches is exactly what we need for this application, as the partitions are expected to be balanced.

the distribution of vector lengths, the distribution of stream lengths, and the density of NaNs in your data.

Modern columnar databases usually implement a MonetDB/X100-like vectorized execution engine that processes data in batches. The batch size (i.e., vector length) is usually 1K to 10K, which does not change much these years (see Figure 10 in the paper). In our system,
the data of each table is stored in many files, each of which is limited to 10K - 1M rows. We collect sketches for each file and then merge them to get the sketches for the whole table. Therefore, the stream length is usually in the range of 10K to 1M. Since the original data are usually integers or decimal numbers, NaNs are rare.

Independent of how large your factor K is, as the stream gets longer and longer it will reduce down to at most 8 items immediately after compressWhileUpdatingSketch(dblSk) is called. 8 items is not a lot of free space!

This is an important message. Thanks for pointing it out!

As currently written, it's iterating through the input array at least 3 times -- once to compute the NaN flags, once for the min/max updates, and then again for the chunked copy.
Is there a reason that you feel this is faster than doing just one loop?

I think the vectorized update would be faster than the one-loop update based on the same intuition that columnar databases are faster than row-oriented databases for analytical queries. The intuition is that the vectorized implementation can:

  • Save the overhead of per-element branches, function calls, and bounds checks.
    • Although the JIT compiler can do de-virtualization and inline the function calls, the update and updateDouble methods themselves are not inlined according to our observations.
    • Modern CPUs are good at branch prediction, but still, the overhead of per-element branches is not negligible.
  • The high complexity of the loop body may prevent the compiler from doing advanced optimizations for the non-vectorized one-loop update. It is well-known that compilers like simple loops, which makes it easier for them to do optimizations like loop unrolling and auto-vectorization.
  • System.arraycopy has been implemented in a highly optimized way using
    SIMD instructions.

These are just my intuitions. I will follow your suggestions and do some benchmarking experiments to see if the vectorized update is faster.

I have done an initial benchmarking experiment and the results seem to be promising. I used the following benchmarking code to test the performance of vectorized & non-vectorized updates:

https://github.com/fanyang01/datasketches-java/blob/43c9a93cce29f66f7846ba3db71a12759c01c4ff/src/test/java/org/apache/datasketches/kll/KllDoublesSketchTest.java#L188-L236

The two methods are the same except that one uses the vectorized update method and the other uses the non-vectorized update method. I run them in IDEA on my MacBook (chip: Apple M1 Pro, memory: 32GB). The results are as follows:

/Library/Java/JavaVirtualMachines/graalvm-jdk-21.0.1+12.1/Contents/Home/bin/java -ea -Xmx4g ...
vectorizedUpdates: 12788 ms
/Library/Java/JavaVirtualMachines/graalvm-jdk-21.0.1+12.1/Contents/Home/bin/java -ea -Xmx4g ...
nonVectorizedUpdates: 15239 ms

The vectorized update method is about 15% faster than the non-vectorized update method in this experiment.

Any suggestions on how to improve the implementation and the benchmarking settings are welcome! I will continue to do more experiments and keep you updated.

This is excellent feedback and gives us a good idea of what you are trying to do. I read the SIGMOD paper on Learned Sorting you referenced. Very interesting! Your idea of using a KLL to create buckets for your improved sorting approach makes lots of sense. It might even make sense to see if we can implement the Learned Sorting algorithm using KLL to generate the CDF for the model.

This is very similar to some work I did recently on an algorithm to create equally-sized partitions from extremely large data sets. For example: Suppose you need to partition a dataset of 100B items into partitions of size 3M items. And, of course, for good retrieval properties, the partitions need to be sorted externally, and internally. This implies that the result would be about 33K partitions. Attempting to do this with a single quantiles sketch would fail because requesting 33K evenly spaced quantiles from even the largest KLL sketch requires a precision that is much finer than the base accuracy of the sketch! Well, I have an app for that! :) :) And the prototype code for this is actually published as part of our recent 5.0.X releases.

Unfortunately, this solution needs to be parallelized, and the code I have so far is only written for a single CPU, single threaded, which I used to prove out the algorithm. But parallelization is highly dependent on the specific environment, platform and systems involved. For those folks that are willing to write the parallel solution for their environment, the prototype code is a starting point.

Just to demonstrate what it can do, I did an experiment with the same numbers as above: N = 100B, Partition size = 3M. The number of output partitions ended up as 32,768, but with a size RMS Error of .027%!

Let me know if you are also interested in this.

Meanwhile, I'm going to play with your test code example also.

This Issue is now closed as it has been completed by PR #539.