waleedsamy / elastic-phonatic-comparsion

comparison between elasticsearch phonatic algorithms

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

elastic-phonatic-comparsion

comparison between elasticsearch phonatic algorithms

TLDR;

click me to see the comparison.

start elasticsearch 5 and kibana

  docker-compose up -d

elasticsearch plugings needed docker run -it elasticsearch bash then:

  # provide char_filter icu_normalizer and filter icu_folding
  bin/elasticsearch-plugin install analysis-icu
  # provide phonatic algorithms
  bin/elasticsearch-plugin install analysis-phonetic

  #### restart elasticsearch => docker-compose restart

Test:

  • load kibana console with test date by clicking here

Notes:

  • if you are using elasticsearch older than 5, you should check changes happened to suggester
  • autocomplete filed does not need payload attribute any more, _source is used as a replacement
  • autocomplete filed does not need output attribute, you could use _source to simulate it.
  • consider using index/_search instead of the deprecated index/_suggest
  • Fuzzy Matching to Search by Sound

Issues:

  • suggester context do ORed thing not AND 21291

How to custom analyzers

  • char_filters: Character filters are used to preprocess the stream of characters before it is passed to the tokenizer.

  • tokenizer: A tokenizer receives a stream of characters, breaks it up into individual tokens (usually individual words), and outputs a stream of tokens.

  • filter: Token filters accept a stream of tokens from a tokenizer and can modify tokens (eg lowercasing), delete tokens (eg remove stopwords) or add tokens (eg synonyms).

       {
          "analyzer":{
             "my_custom_analyzer":{
                "type":"custom",
                "tokenizer":"standard",
                "char_filter":[
                   "html_strip"
                ],
                "filter":[
                   "lowercase",
                   "asciifolding"
                ]
             }
          }
       }
    

results

  • ✓ means match
  • ☓ means doesn't match
cities/algorithms cologne soundex metaphone doublemetaphone
Kairo vs Cairo
Kairo vs Qairo
Koln vs Köln
Paris vs Biarritz
Berlin vs Paralimni

suggestion

  • Metaphone algorithm produce the most expected result for my test
  • Don't depend only on the phonatic algorithm, you should have a Weight for every suggestion

About

comparison between elasticsearch phonatic algorithms