elasticsearch7-query-filter-aggregation-workshop

references

preface

goals of this workshop
- https://github.com/mtumilowicz/java12-elasticsearch-inverted-index-workshop
- introduction to the basics of API
  - creating index, customize analyzers
  - define mappings
  - indexing documents
  - searching: query, filter
    - analyzing output
  - aggregating
in docker-compose there is elasticsearch + kibana (7.6) prepared for local testing
- cd docker/compose
- docker-compose up -d
workshop and answers are in workshop directory

index

note that 7.0 deprecated APIs that accept types, introduced new typeless APIs

create index

PUT /index-name
{
    "settings": {
        "index" : {
            ... // configure index
        },
        "analysis": {
            ... // customize analyzer
        }
    },
    "mappings": {
        "properties": {
            ... // fields
        }
    }
}

field datatypes
- any field can contain zero or more values by default, however, all values in the array must be of the same datatype
- string
  - text
    - full-text indexed (analyzed)
    - are not used for sorting and seldom used for aggregations
    - example: body of an email or the description of a product
    - sometimes it is useful to have multiple version of the same field: one for full text search and the other for aggregations and sorting
  - keyword
    - are only searchable by their exact value (not analyzed)
    - typically used for filtering, sorting, and aggregations
    - example: IDs, email addresses, hostnames, status codes, zip codes or tags
- numeric
  - byte, short, integer, long ...
- date
```
"date": {
    "type":   "date",
    "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
}
```
  - internally, dates are converted to UTC (if the time-zone is specified) and stored as a long number representing milliseconds-since-the-epoch
  - queries are internally converted on this long representation, and the result is converted back to a string according to the field's date format
- many more: range, object, nested, geo-points...
indexing documents
- example
  - with specified id
```
POST /index-name/_create/id
{
    ... // fields
}
```
  - autogenerated id
```
POST /products/_doc
{
  "name": "Tablet",
  "price": 499
}
```
- each indexed document is versioned
  - prevent conflicting updates
deleting documents
- example
```
DELETE /index-name/_doc/id
```
- optimistic locking - version can be specified
  - example: DELETE /products/_doc/123?version=3
updating documents
- every update is actually a delete + reindex operation.
- example
```
POST /index-name/_update/id
{
    "doc" : {
        "name" : "new_name"
    }
}
```
- partial update (patch): POST
  - for replace (full overwrite): PUT
- optimistic locking - version can be specified
  - example: POST /products/_doc/123?version=3
- by default updates detect if they don’t change anything and return "result": "noop"

search

mechanics

every node in the cluster can handle HTTP and Transport traffic by default
- HTTP = External API access (CRUD/search)
- Transport traffic = internal communication protocol that Elasticsearch nodes use to talk to each other
search requests may involve data held on different data nodes
a search request is executed in two phases
- scatter phase
  - coordinator forwards request to the data nodes which hold the data
  - each data node executes the request locally and returns its results to the coordinating node
- gather phase
  - coordinator reduces each data node’s results into a single global resultset

API

template for search requests

GET index-name/_search // you can search entire cluster: GET /_search {...}
{
    "query": {
        "query-type": { // query type: 'match', `range`, `term`, etc
          // payload specific to query type
        }
    }
}

query
- match
  - text is analyzed before matching
  - standard query for performing a full-text search
  - per field
  - example
```
"match": {
  "description": {
    "query": "lightweight laptop with fast chip"
  }
}
```
- query_string
  - text is analyzed before matching
  - if no field default - search all document
  - used to create a complex search (wildcard characters, searches across multiple fields)
  - example
```
"query_string": {
  "query": "apple AND macbook OR iphone",
  "default_field": "name"
}
```
- match_phrase
  - text is analyzed before matching
  - all the terms must appear in the field
  - terms must have the same order
    - configured slop
      - allows some flexibility in word positions within the phrase
        
        example: gaps or slight reordering
      - this is a brown dog and the dog is brown are OK with slope = 1
  - example
```
"match_phrase": {
  "name": {
    "query": "macbook pro",
    "slop": 1
  }
}
```
- range
  - used for numeric, date, or keyword fields to search within a range
  - example
```
"range": {
  "price": {
    "gte": 500,
    "lte": 1000
  }
}
```
- term
  - text is NOT analyzed before matching
  - exact matching
    - best for keyword fields, IDs, status flags, enums
  - vs filtering: affects _score
  - example
```
"term": {
  "status": "available"
}
```
- bool
  - boolean combinations of other queries
    - all clauses are combined with a top-level logical AND
  - example
```
"bool" : {
    "must" : {
        ...
    },
    "filter": {
        ...
    },
    "must_not" : {
        ...
    },
    "should" : [
        ...
    ],
}
```
  - must
    - must appear in matching documents
    - contributes to the score
  - must_not
    - must not appear in the matching documents
    - scoring is ignored
    - considered for caching
  - filter
    - must appear in matching documents
    - bypass analysis
      - filtering 'Andy' when indexed is 'andy' will give no hit
    - score of the query will be ignored
    - considered for caching
    - Elasticsearch constructs a bitset, which is a binary set of bits denoting whether the document matches this filter
  - should
    - should appear in the matching document
    - contributes to the score

response body

{
    "took" : 1, // in milliseconds
    "timed_out" : false,
    "_shards" : { // count of shards used for the request
        "total" : 1, // total number of shards that require querying
        "successful" : 1, // number of shards that executed the request successfully
        "skipped" : 0,
        "failed" : 0 // number of shards that failed to execute the request
    },
    "hits" : { // documents and metadata
        "total" : { // metadata about the number of returned documents
            "value" : 2, // total number of returned documents
            "relation" : "eq"
        },
        "max_score" : 0.9395274, // highest returned document _score
        "hits" : [ // array of returned document objects
            {
                "_index" : "programming-user-groups",
                "_id" : "2",
                "_score" : 0.9395274,
                "_source" : { ... } // original JSON body
            } 
        ]
    }
}

took
- time elapsed between: receiving request on the coordinator and being ready to send response to client
_shard.skipped
- skipped the request because a lightweight check helped realize that no documents could possibly match on this shard
- typically happens when a search request includes a range filter and the shard only has values that fall outside of that range
hits.total.relation
- tells whether total.value is exact or just an estimate
  - eq: Accurate
  - gte: Lower bound, including returned documents

aggregate

template

GET /programming-user-groups/_search
{
    "aggs": {
        "agg-name" : {
            "agg-type": { ... },
            "aggs":{ // sub aggregations
                "sub-agg-name": {
                    "sub-agg-type": { ... }
                }
            }
        }
    }
}

types

bucketing

group documents into buckets based on some criteria
- example: terms, ranges, dates, etc.
can be nested
- bucketing aggregations can have sub-aggregations (bucketing or metric)
types: terms, range, date_histogram, filters, histogram

example

request

"aggs": {
  "genres": {
    "terms": {
      "field": "genre.keyword"
    }
  }

response

{
    ... // same as search response
    "aggregations" : {
        "genres" : {
            ...
            "buckets" : [ 
                { "key" : "electronic", "doc_count" : 6 },
                { "key" : "rock", "doc_count" : 3 },
                { "key" : "jazz", "doc_count" : 2 }
            ]
        }
    }
}

metrics

calculate values like average, min, max, count
- usually inside buckets
types: avg, sum, min/max, value_count, stats, percentiles

example

request

"aggs": {
  "max_price": {
    "max": {
      "field": "price"
    }
  }
}

response

{
    ... // same as search response
    "aggregations": {
        "max_price": {
            "value": 200.0
        }
    }
}

pipeline

use the output of other aggregations as input
uses the order of buckets returned by a bucket aggregation
types
- derivative
  - formula: current_value - previous_value
  - example: daily change in average product price
- moving_fn
  - formula: avg(values in [N most recent buckets])
  - example: 3-day moving average of pric
- cumulative_sum
  - formula: sum = prev_sum + current_value
  - example: revenue over days
- bucket_script
  - formula: custom_script(params.metric1, params.metric2)
- bucket_selector
  - formula: include bucket if script(params.metric) is true
  - example: keep brands with avg price > 500
- avg_bucket, sum_bucket, max_bucket, min_bucket
  - formula: aggregate([bucket_values])
  - example: Average of daily average prices
- serial_diff
  - formula: value[n] - value[n - lag]
  - example: compare today's avg price to price 7 days ago

example: how average price changed over time

request

GET /products/_search
{
  "size": 0,
  "aggs": {
    "by_day": {
      "date_histogram": {
        "field": "timestamp",     // must be a date field
        "calendar_interval": "day"
      },
      "aggs": {
        "avg_price": {
          "avg": {
            "field": "price"
          }
        },
        "price_change": {
          "derivative": { // calculates the difference from the previous bucket
            "buckets_path": "avg_price"
          }
        }
      }
    }
  }
}

response

"aggregations": {
  "by_day": {
    "buckets": [
      {
        "key_as_string": "2025-05-01",
        "avg_price": { "value": 150 },
        "price_change": { "value": null }
      },
      {
        "key_as_string": "2025-05-02",
        "avg_price": { "value": 170 },
        "price_change": { "value": 20 }
      },
      {
        "key_as_string": "2025-05-03",
        "avg_price": { "value": 140 },
        "price_change": { "value": -30 }
      }
    ]
  }
}

drawbacks
- memory footprint
  - terms aggregation builds an in-memory map of buckets
    - bucket state lives in the JVM heap
    - high-cardinality fields => number of unique values (buckets) is huge
    - example
      { "brand:Apple" => 1542, "brand:Dell" => 1100, ... }
- aggregations run per shard
  - each shard computes top N results independently => coordinating node merges
  - problem: global ranking inaccuracy
    - example: product not in any local top 5 are excluded, even if their global total would be higher
- all aggregations are on-the-fly
  - no materialized views or persisted rollups

mtumilowicz / elasticsearch7-query-filter-aggregation-workshop

elasticsearch7-query-filter-aggregation-workshop

preface

index

search

mechanics

API

response body

aggregate

About