jprante / elasticsearch-plugin-bundle

Hello,

i tried now to complete the examples for Kibana, see
https://gist.github.com/ThaDafinser/d27b4fa9d144b0083ee7dad37484fdd8

For the example i've gone through the complete plugin-list
https://github.com/jprante/elasticsearch-plugin-bundle#a-plugin-bundle-for-elastisearch

For those plugins i couldn't find docs ( @jprante could cou help me here pls?)

elasticsearch-analysis-autophrase
elasticsearch-analysis-concat (update: found a small example, but dunno the options)
elasticsearch-analysis-sortform
elasticsearch-analysis-symbolname (update: found a small example, but dunno the options)
elasticsearch-analysis-year (update: found a small example, but dunno the options)

Other missing examples for now (could not create a "live" example yet)

could not create over _analyze API for icu_collation
elasticsearch-analysis-naturalsort (one example added)
elasticsearch-analysis-reference (@todo could not create a working example with ES 5.1.2)
elasticsearch-mapper-crypt (one example added)
elasticsearch-mapper-langdetect (one example added)

Are there any other things missing? When they are finished: Do you want them in README or in a seperate file?

For auto_phrase i found so far (could not get it working)

GET _analyze
{
  "tokenizer": "standard",
  "filter": [
    {
      "type": "auto_phrase",
      "phrases": [
        "C:/Data/test.txt"
      ]
    }
  ],
  "text": "what is my income tax refund this year now that my property tax is so high"
}
https://github.com/jprante/elasticsearch-plugin-bundle/blob/68dc19c34c40364e04400f92500b973a6cbae170/src/main/java/org/xbib/elasticsearch/index/analysis/autophrase/AutoPhrasingTokenFilterFactory.java

Hi,

In addition to the original issue, LemmatizeTokenFilter lacks description too. I would appreciate any info on how to configure it, on supported languages and what is behind this plugin.

To me this plugin looks similar to baseform plugin. From skimming through the code I can guess that the lemmatizer replaces the original word while baseform-er adds generated form alongside the original.

Thanx

@nkrot in general you gave the answer.

I updated the gist with an example. Like you said, it just keeps the baseform and removes the original word

GET _analyze
{
  "tokenizer": "standard",
  "filter": [
    {
      "type": "lemmatize",
      "language": "de"
    }
  ],
  "text": "Ich gehe gerne mit meinen neuen Schuhen"
}

@ThaDafinser , thank you. Do you have any info on

respectKeywords, available in lemmatize plugin
lemmaOnly, available in lemmatize plugin
from where come lemmatizer resources (FSA) and how they compare to baseform

thanx,

Sadly not yet.

You can see a lot of examples in the tests, how it should work.

elasticsearch-plugin-bundle/src/test/java/org/xbib/elasticsearch/index/analysis/lemmatize/LemmatizeTokenFilterTests.java

Line 113 in 93ed7cb

.put("index.analysis.filter.myfilter.lemma_only", "false")

LemmatizeTokenFilter is still work in progress, in experimental stage. It is considered as an alternative to a synonym token filter https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-tokenfilter.html
but based on a language-specific dictionary of known compound words.

After going through a lot of examples, code and so on...

I think the best would be to create something like this
https://github.com/ThaDafinser/elasticsearch-plugin-bundle/blob/feature/doc/docs/index.md

For a "one pager" (or add all in Readme) there are too many things to explain, and with such an approach the documentation can be created step by step.

Like mentioned at the end, it's similar to the ES reference guide structure https://www.elastic.co/guide/en/elasticsearch/reference/5.3/index.html

@jprante what do you think? If you like it, i will add some more pages and create a PR for this one.

Docs: searching for example