davidmr001 / elasticsearch-topk-plugin

Elasticsearch Top-K Aggregation Plugin

Home Page:https://github.com/elasticsearch/elasticsearch/issues/6697

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Disclaimer: While we're not using ElasticSearch for Algolia's hosted full-text, numerical & faceted search engine; we're using it for internal analytics (faceting over billions of log lines generated by our engine, no full-text search).

Elasticsearch Top-K Plugin

This plugin extends Elasticsearch providing a fast & memory-efficient aggregation statistically retrieving the Top-K elements of a field. The field can be either string, numerical or boolean. The plugin registers a new type of aggregation (topk).

This plugin is a temporary replacement of #6697.

We love pull-requests!

Prerequisites:

  • Elasticsearch 1.3.0+

Binaries

  • Compiled versions of the plugin are stored in the dist directory.

Why

The default terms aggregations implementations use an amount of memory that is linear with the cardinality of the value source they run on. Things get even worse when using sub aggregations, especially the memory-intensive ones such as percentiles, cardinality, top_hits or bucket aggregations. This plugin is based on the Space-Saving algorithm, which try to detect the most frequent terms with a fixed (configurable) number of counters.

Principle

This plugin uses the StreamSummary data structure provided by the Stream-lib library to compute the top-k values of a field. Basically, it retrieves the most frequent terms of a field without loading all of them (and their associated sub aggregations) into RAM. The merge between shards and between indices is supported but might introduce accuracy issues: this is the general trade-off of this algorithm.

Usage

To build an aggregation keeping the top-k elements of a field, use the following code:

{
  "aggregations": {
    "<aggregation_name>": {
      "topk": {
        "field": "<field_name>",
        "size": 10
      }
    }
  }
}

For example, to keep the 100 most frequent values of your "ip" field, use:

{
  "aggregations": {
    "top_ips": {
      "topk": {
        "field": "ip",
        "size": 100
      }
    }
  }
}
{
  "aggregations": {
    "top_ips": {
      "buckets": [
        { "key": "1.2.3.4", "doc_count": 62718 },
        { "key": "5.6.7.8", "doc_count": 54233 },
        [...]
        { "key": "1.6.3.8", "doc_count": 12123 },
      ]
    }
  }
}

Setup

Installation

./plugin --url file:///absolute/path/to/elasticsearch-topk-plugin-LATEST.zip --install topk-aggregation

Uninstallation

./plugin --remove topk-aggregation

About

Elasticsearch Top-K Aggregation Plugin

https://github.com/elasticsearch/elasticsearch/issues/6697

License:Apache License 2.0


Languages

Language:Java 100.0%