This plugin extends Elasticsearch providing a fast & memory-efficient way to estimate the cardinality (number of uniq terms) of a field. The field can be either string, numerical or boolean. The plugin registers a new type of aggregation (cardinality
) and a REST action (_cardinality
).
We love pull-requests!
- Elasticsearch 1.0.0+
- Compiled versions of the plugin are stored in the
dist
directory.
This plugin uses the HyperloglogPlus algorithm provided by the Stream-lib library to estimate the cardinality (uniq term count) of a field. Basically, it estimates the number of uniq values of a field without loading all of them into RAM. The merge between shards and between indices is supported (and efficient).
Without such plugin, the only way to count the uniq number of values in a field was to retrieve all values on the client-side and to count the length of the resulting array (Totally inefficient).
To estimate the cardinality of a field, use the following REST action:
curl -XGET http://localhost:9200/{index}/{field}/_cardinality
For example, to estimate the number of uniq IPs in the index logstash-2014.02.03
:
curl -XGET http://localhost:9200/logstash-2014.02.03/ip/_cardinality
{
"_shards": {
"total": 2,
"successful": 2,
"failed": 0
},
"count": 46367
}
To estimate the number of uniq IPs in several indices:
curl -XGET http://localhost:9200/logstash-2014.01.*/ip/_cardinality
{
"_shards": {
"total": 86,
"successful": 86,
"failed": 0
},
"count": 919979
}
To build an aggregation estimating the cardinality of a field, use the following code:
{
"aggregations": {
"<aggregation_name>": {
"cardinality": {
"field": "<field_name>"
}
}
}
}
For example, to estimate the number of uniq IPs in a result set, use the following code:
{
"aggregations": {
"uniq_ips": {
"cardinality": {
"field": "ip"
}
}
}
}
{
"aggregations": {
"uniq_ips": {
"value": 42
}
}
}
./plugin --url elasticsearch-cardinality-plugin-0.0.1.zip --install index-cardinality
./plugin --remove index-cardinality