peterlamar / elasticsearch-cheatsheet

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ElasticSearch cheatsheet

Elasticsearch cheatsheet and quickstart study guide

Concepts

  1. Documents
  2. Indices
  3. Nodes
  4. Shards
  5. Use Cases

curl

  1. Backup index
  2. List index mapping
  3. Delete index
  4. List indexes
  5. Mapping
  6. Analyzers
  7. NGram
  8. Insert
  9. Bulk-insert
  10. Update
  11. Delete
  12. Get
  13. Cluster Status
  14. List Masters

Search

  1. Match
  2. Fuzzy
  3. Prefix
  4. Wildcard
  5. Match Phrase
  6. Match Phrase Prefix
  7. Filters
  8. Query lite search
  9. Pagination
  10. Sort

Misc

  1. Docker

concepts

Documents

Things you are searching for, can be any text but typically json. Each document has a unique ID, version and type. A document is kind of like a row in a database.

Example:

{
  "name": "Elastic",
  "location": "somewhere",
  "data": [1,2]
}

Example response after posting to ES index:

{
  "_index": "myindex",
  "_type": "_doc",
  "_id": "ndskdf239dkD",
  "_version": 1,
  "result": "created:,
  "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
  }
  "_seq_no": 21,
  "_primary_term" : 1
}

Indices

Also called an inverted index, basically the lookup table in the back of a book

Document 1:

Space: The final frontier. These are the voyages

Document 2:

He's bad, he's number one. He's the space cowboy with the laser gun!

Inverted index

space:    1,2
the:      1,2
final:    1
frontier: 1
he:       2
bad:      2

Indexes can be created for you or you can create them manually.

Example index create

PUT /inspections
{
  "settings": {
    "index.number_of_shards": 1,
    "index.number_of_replicas": 0
  }
}

Each index has a "number_of_shards" value to pertains to that index. Five primary shards and one replica created by default for each index.

Nodes

Nodes are servers added to a cluster to increase capacity

Shards

Shards are self-contained Lucene indexes. Documents in an index can be distributed across multiple shards (10 documents per shard for example). Shards can be distributed across multiple nodes. As cluster grows or shrinks Elasticsearch migrates shards to rebalance cluster

There are two types of shards, primaries and replicas and each document belongs to a primary shard. Only the primary shard can accept indexing requests but both can accept query requests.

The number of primary shards in an index is fixed at index creation time but replicas can be changed at any time. Number of shards can be considered similar to disk partition.

Shards are allocated based on dataset growth expectations.

Each shard:

  1. Consumes file handles, memory and CPU resources
  2. Each search request touches a copy of every shard
  3. Problems can happen when shards compete for the same hardware resource
  4. More shards has lower document relevance

Performance considerations:

  1. Queries are sent to each shard simultaneuously and then the results are aggregated. More I/O headroom and multicore processor can benefit from sharding.
  2. More shards involves more maintenence overhead
  3. Larger shards mean longer cluster rebalance times
  4. Querying small shards makes processing per shard faster
  5. More queries involves more overhead, so a smaller number of large shards maybe faster.

Advice:

  1. Ideal scenario is one shard per index per node
  2. Starting point for cluster planning, allocate shards with a factor of 1.5 to 3 times the number of nodes in the initial configuration. So if starting with 3 nodes then max 9 shards
  3. Recommended shard size 1GB < x < 40GB, with common sizes 20GB < x < 40GB. Divide expected data size by number of shards to reach reasonable number. For example, 200GB of data then have 7 shards at approx 30GB each
  4. Number of shards per GB of heap space should be less than 20
  5. Max JVM Heap Size recommendation for Elasticsearch = 30-32GB

Performance Experiments

  1. Elastic cluster sizing

Use-Cases

  1. Logstash - Accumilating daily indices, incurring small search loads

If left with the default of 5 primary shards for every index (double if to include the default replica), then after six months there could be 5 x 30 x 6 = 890 shards which would require > 15 nodes (Roughly 60 primary shards per node or approx 15 shards for each GB of heap space assumed to be 4 for primary shards and 4 for the replicas)

A custom setting of 1 shard per node with a single replica will be 180 shards in six months which is more managable

curl

Backup-index

curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/_reindex -d '{
  "source": {
    "index": "samples"
  },
  "dest": {
    "index": "samples_backup"
  }
}'

List-Index-Mapping

List the fields and their types in an index

curl -X GET http://localhost:9200/samples

Delete-Index

curl -X DELETE 'http://localhost:9200/samples'

List Indexes

curl -X GET 'http://localhost:9200/_cat/indices?v'

List docs in index

curl -X GET 'http://localhost:9200/sample/_search'

Mapping

Mappings are schema definitions that customize defaults.

./curl -XPUT 127.0.0.1:9200/movies -d '
{
    "mappings":{
        "properties":{
            "year":{"type":"date"}
        }
    }
}'

Field Types: String, Byte, Short, Integer, Long, Float, Double, Boolean, Date

./curl -XPUT 127.0.0.1:9200/movies -d '
{
    "mappings":{
        "properties":{
            "user_id":{"type":"long"}
        }
    }
}'

Index field for full-text search?

./curl -XPUT 127.0.0.1:9200/movies -d '
{
    "mappings":{
        "properties":{
            "genre":{"index":"not_analyzed"}
        }
    }
}'

Make genre type keyword so its not analyzed (matches other cases, etc)

./curl -XPUT 127.0.0.1:9200/movies -d '
{
    "mappings":{
        "properties": {
            "id": {"type":"date"},
            "year":{"type":"date"},
            "genre":{"type":"keyword"},
            "title":{"type":"text","analyzer":"english"}
        }
    }
}
'

Map film to franchise to make a parent

./curl -XPUT 127.0.0.1:9200/series -d '
{"mappings":{
    "properties":{
        "film_to_franchise":{
            "type":"join","relations":{
                "franchise":"film"
                }}}}
    }'

Analyzers

  1. Use standard analyzer if none is specified
  2. Character Filters: Remove HTML encoding, convert & to and
  3. Tokenizer: Split strings on whitespace/punctiation/non-letters
  4. Token Filter: Lowercasing, stemming, synonyms, stopwords
  5. Standard: split on word boundaries, remove punctuation, lowercases, good if language is unknown
  6. Simple: Split on anything that isn't a letter, and lowercase
  7. Whitespace: Splits on whitespace but doesn't lowercase
  8. Language: Accounts for language-specific stopwords and stemming

Test analyzer

./curl -XGET 127.0.0.1:9200/movies/_analyze\?pretty -d '{
    "analyzer": "autocomplete",
    "text": "Sta"
}'

Specify analyzer

./curl -XGET 127.0.0.1:9200/movies/_search\?pretty -d '{
    "query":{
        "match":{
            "title":{
                "query":"sta",
                "analyzer":"standard"
            }
        }
    }
}'

NGram

Custom analyzer 'autocomplete'

./curl -XPUT 127.0.0.1:9200/movies -d '{
    "settings":{
        "analysis":{
            "filter":{
                "autocomplete_filter": {
                    "type":"edge_ngram",
                    "min_gram":1,
                    "max_gram":20
                }
            },
        "analyzer":{
            "autocomplete":{
                "type":"custom", 
                "tokenizer":"standard",
                "filter": [
                    "lowercase",
                    "autocomplete_filter"
                ]
                }
            }
        }
    }
}'

Assign analyzer as mapping

./curl -XPUT '127.0.0.1:9200/movies/_mapping?pretty' -d '
{
    "properties": {
        "title":{
            "type":"text",
            "analyzer":"autocomplete"
        }
    }
}
'

insert

Insert record

./curl -XPUT 127.0.0.1:9200/movies/_doc/109487 -d '
{
    "genre":
    ["IMAX","Sci-Fi"],
    "title":"Interstellar",
    "year":2014
}'
curl -XPUT --header 'Content-Type: application/json' http://localhost:9200/samples/_doc/1 -d '{
   "school" : "Harvard"			
}'

bulk-insert

./curl -XPUT 127.0.0.1:9200/_bulk\?pretty --data-binary @movies.json

update

Each document has a _version firle and is immutable When you update an existing document, a new document is created with an incremented _version and then the old document is marked for deletion

./curl -XPOST 127.0.0.1:9200/movies/_doc/109487/_update -d '
{
    "doc" :{
    "title":"Interstellarxx"
    }
}
'

Insert and Update

curl -XPUT --header 'Content-Type: application/json' http://localhost:9200/samples/_doc/2 -d '
{
    "school": "Clemson"
}'
curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/samples/_doc/2/_update -d '{
"doc" : {
               "students": 50000}
}'

delete

curl -XDELETE 127.0.0.1:9200/movies/_doc/58559

Get

Get movie with ID 109487

./curl  -XGET 127.0.0.1:9200/movies/_doc/109487\?pretty

Cluster Status

./curl  -XGET 192.168.86.23:9200/_cluster/stats\?pretty

List Masters

./curl  -XGET "192.168.86.23:9200/_cat/master?v=true&pretty"

search

Queries are wrapped in a "query": { } block

Query Types

  • Match all: Returns all documents and is default. Formally used with a filter
{"match_all":{}}
  • Match: Searches analyzed results, such as full text search
{"match":{"title":"star"}}
  • Multi-match: Run the same query on multiple fields
{"multi_match":{"query":"star","fields":["title","synopsis"]}}
  • Bool: Works like a bool filterm but results are scored by relevance
./curl -XGET 127.0.0.1:9200/movies/_search\?pretty -d '
{
    "query":{
        "bool": {
            "must":{"term":{"title":"trek"}},
            "filter":{"range":{"year":{"gte":2010}}}
        }
    }
}'

Get all movies

./curl  -XGET 127.0.0.1:9200/movies/_search\?pretty
./curl  -XGET 127.0.0.1:9200/movies/_search\?q=dark

Match

./curl -XGET 127.0.0.1:9200/movies/_search\?pretty -d '{"query":{"match":{"title":"star"}}}'

Fuzzy

Fuzzy defaults

  • 0 for 1-2 character strings
  • 1 for 3-5 character strings
  • 2 for anything else

Allow 1 character off

./curl -XGET 127.0.0.1:9200/movies/_search\?pretty -d '{"query":{"fuzzy":{"title":{"value":"intersteller","fuzziness":1}}}}'

Prefix

./curl -XGET 127.0.0.1:9200/movies/_search\?pretty -d '{"query":{"prefix":{"year":"201"}}}'

Wildcard

./curl -XGET 127.0.0.1:9200/movies/_search\?pretty -d '{"query":{"wildcard":{"year":"1*"}}}'

Match Phrase

Match Phrase (order and lettering)

./curl -XGET 127.0.0.1:9200/movies/_search\?pretty -d '{"query":{"match_phrase":{"title":"star wars"}}}'

Match Phrase

./curl -XGET 127.0.0.1:9200/movies/_search\?pretty -d '{"query":{"match_phrase":{"genre":"sci"}}}'

Match Phrase with Slop which allows term to move in either direction

  • enables star beyond to match Star Trek Beyond (also beyond star)
  • enables "quick brown fox" to match "quick fox" with a slop of 1
  • If slop of 100 is specified, then any document with 'star' or 'beyond' within 100 words could be returned, but closer values are returned with higher relevance
./curl -XGET 127.0.0.1:9200/movies/_search\?pretty -d '{"query":{"match_phrase":{"title":{"query":"star beyond", "slop": 1}}}}'

Find films whose parent matches "Star Wars" ./curl -XGET 127.0.0.1:9200/series/_search?pretty -d ' {"query":{ "has_parent":{ "parent_type":"franchise", "query":{ "match":{ "title":"Star Wars"} } } } }'

Find franchise associated with a film

./curl  -XGET 127.0.0.1:9200/series/_search\?pretty -d '
{"query":{
    "has_child":{
        "type":"film",
        "query":{
            "match":{
                "title":"The Force Awakens"}
        }
    }
}
}
'

Match Phrase Prefix

Can be used to implement autocomplete

./curl -XGET 127.0.0.1:9200/movies/_search\?pretty -d '{"query":{"match_phrase_prefix":{"title":{"query":"star"}}}}'

Filters

Filters are wrapped in a "filter": { } block

Types of Filters

  • Term: Filter by exact values
{"term":{"year":2014}}
  • Terms: Match if any exact values in a list match
{"terms":{"genre":["Sci-Fi","Adventure"]}}
  • Range: Find numbers or dates in a given range (gt, gte, lt, lte)
{"range":{"year":{"gte": 2010}}}
  • Exists: Find documents where a field exists
{"exists":{"fields":"tags"}}
  • Missing: Find documents where a field is missing
{"missing":{"field":"tags"}}
  • Bool: Combine filters with Boolean logic (must, must_not, should)

Query Lite

Compared to json queries, can be:

  1. Cryptic
  2. Security vulnerabile
  3. Fragile

Get movie with title star

./curl  -XGET '127.0.0.1:9200/movies/_search?q=title:star&pretty=true'

Request body equivalent

./curl  -XGET 127.0.0.1:9200/movies/_search\?pretty -d '{
    "query":{
        "match":{
            "title":"star"
        }
    }
}'

Released after 2010 with Trek in the title

./curl  -XGET '127.0.0.1:9200/movies/_search?q=+year>2010+title:trek&pretty=true'

Request body equivalent

./curl  -XGET 127.0.0.1:9200/movies/_search\?pretty -d '
{
    "query":{
        "bool":{
            "must":{"term": {"title":"trek"}},
            "filter":{"range":{"year":{"gte":2010}}}
        }
    }
}'

pagination

  • Pagination results are still retrieved, but sorted and o mitted before returning to user
./curl -XGET '127.0.0.1:9200/movies/_search?size=2&from=2&pretty'
./curl -XGET 127.0.0.1:9200/movies/_search\?pretty -d '{"from": 2, "size": 2, "query":{"match":{"genre":"Sci-Fi"}}}'

When from is omitted it starts from 0

./curl -XGET '127.0.0.1:9200/movies/_search?size=2&pretty'

sort

  • A string field that is analyzed for full text search cannot be used to sort documents since it exists in the inverted index as individual terms
./curl -XGET '127.0.0.1:9200/movies/_search?sort=year&pretty'

A copy of a field could be made so allow full text search and raw sorting

./curl -XPUT 127.0.0.1:9200/movies/ -d '{
    "mappings": {
        "properties" : {
            "title": {
                "type":"text",
                "fields":{
                    "raw":{
                        "type":"keyword"
                    }
                }
            }
        }
    }
}'
./curl -XGET '127.0.0.1:9200/movies/_search?sort=title.raw&pretty'
  • Cannot change mapping on an existing index. Would have to delete it, setup mapping and reindex

misc

Docker

docker run -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.10.1

Reference

Content from OReilly course

About


Languages

Language:Python 73.9%Language:Shell 26.1%