elastic / elasticsearch

Free and Open, Distributed, RESTful Search Engine

Home Page:https://www.elastic.co/products/elasticsearch

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Aggregations: Memory-bound terms

jpountz opened this issue · comments

Our terms aggregations implementations use an amount of memory that is linear with the cardinality of the value source they run on. Thanks to global ordinals, we only require 4 bytes per unique term to track the count, so even with a field that has a cardinality of 10M, that would only be 40MB of memory.

However, things get worse when using sub aggregations, especially the memory-intensive ones such as percentiles, cardinality, top_hits or bucket aggregations. Ideally we would want memory usage to depend on size instead of the cardinality of the value source.

I have been looking recently at the Space-Saving algorithm described in section 3.1 of Efficient Computation of Frequent and Top-k Elements in Data Streams. Although described for computing top buckets based on counts, I think it would be possible to use it when sorting by term or sub-aggregation. But you don't get optimized memory usage for free so it would have two main drawbacks compared to the current implementation:

  1. it would only work correctly on skewed distributions
  2. there is no clear separation between buckets and some buckets might aggregate document from other buckets

I think 1 is fine since top terms tend to make more sense on skewed distributions anyway, eg. most frequent terms in natural language text, or top ip addresses that hit a website.

2 is more inconvenient, especially if there are other aggregations under this terms aggregation. One good news is that we could know what buckets might have aggregated data that is not theirs (they would be those whose term ordinal has been updated during collection), so we could have it as part of the response if necessary. On the other hand, if there are no sub aggregations, it would be more acceptable to use this implementation since counts are inaccurate anyway. And it would also work nicely with #6696 since the maximum error can be estimated.

Hey there,

with such algorithm the top buckets are determined once all values have been processed, right? Therefore, the sub (bucket-specific & nested) aggregations can only be computed once all documents have been collected in order to know the final list of buckets, isn't it? Because if I'm right, for each collected document (method collect) we need to call collectableSugAggregators.collect with the associated bucketOrd (which will be unknown at this step). Am I missing something?

Instead, in case of a count-based (or terms-based) sorting, any chance the sub aggregations could be collected in another pass?

Therefore, the sub (bucket-specific & nested) aggregations can only be computed once all documents have been collected

I think sub aggregations could be collected on the fly. The algorithm works on a fixed set of m counters, so we could associate a bucket ordinal (between 0 and m-1) with each of these counters and use them to collect the sub aggregations.

It might introduce accuracy issues for the sub aggregations since counters can be relabeled during collection but this is the general trade-off of this algorithm: trading accuracy for memory, so I think it's fine? Moreover if the distribution of the data is skewed enough and if we oversize m a bit compared to the number of top terms that we are interested in, the likelyhood of the top terms being affected by this issue should remain low.

Makes sense, we're on the same page! Thank you @jpountz

Just wanna let you know that I've released a plugin embedding such algorithm and targeting ES +1.2. Our first tests are pretty concluant 👍

It's still in alpha version but I would love any comment/review/pull-request of ES gurus reading this comment :)

Plugin is here: elasticsearch-topk-plugin

It looks good in general. FYI you would get better performance by specializing on numerics/strings like the terms aggregation does. On numerics this would only help a bit by saving the conversion to BytesRef and allowing to use more efficient data-structures, but on strings this could have a big impact. The way field data works on strings is that there is a first level of indirection that gives ordinals given a doc ID, and a second level of indirection that gives the values given an ordinal. Ordinals have the interesting property of being dense (so that you can use them as indices in arrays) and sorted (useful for comparisons, would help if you want to sort buckets of this top-k agg by term). Additionally they are fast to retrieve. On the other hand, retrieving values given an ordinal might be slow, especially if you are using doc values. That's why our aggregations try to use ordinals whenever possible as this brings important speedups.

Thanks for the inputs @jpountz, I'm definitely planning to use the ordinals but wanted to have a first working version first.

@jpountz is this still something you plan on pursuing?

Yes. I think it can be useful for pagination in particular, when sorting by term. I'll close for now and reopen when I have more concrete plans.