jprante / elasticsearch-plugin-bundle

A bundle of useful Elasticsearch plugins

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Idea to improve decompounding

marbleman opened this issue · comments

In my current "analysis of analyzers" it turned out that lots of times when searches fail, it is due to wrongly decompounded words.

A lot of those words are on the list of the baseform dictionary already.

Maybe its a nobrainer but did you ever consider to add a dictionary functionality to the decompounder and feed the baseform dictionary to it to exclude those words from decompounding?

From my current point of view, this could boost results pretty much:

Examples that must not be decompounded: loskaufen, loslassen, hochziehen, hochdrücken...

Of course it could also be helpful to decompound these words, but in many cases they get decompounded in a wrong way since the pre-syllable is not detected as such by the decompounder. Finally it is a pitty when the baseform filter adds the baseform and the decompounder ruins it in the next step...
Another way could be to tag words to exclude them from further processing within the analyzer. I thought I have seen something like this somewhere but I cannot find it anymore.

Probably all thougts that you already had before... So what is your opinion?

Decompounder is limited to nouns only. This requires some sort of tagging words with a POS analysis IMHO, like UIMA or Stanford POS tagger. This is something I'm interested in, but lack of time prevents improvements.

There is a keyword suppression technique, so the keyword marker filter can be used to exclude words https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-keyword-marker-tokenfilter.html

Thx for the link!! I'll give it a try and let you know.

Having a sort of POS could really help to distinguish between verbs, adjectives and nouns. At least treating nouns with the common "-ung" ending correctly could give a boost. Seems to be problematic atm.

I am very interested in improving the mechanisms here as well since I have to deal with technical literature having crazy compound words all over the place. Some of them must be decompounded, some must not be decompounded.
Let me know if see a way to bundle/concentrate our efforts in improving this.

Well, the result tells me:
The keyword_marker does not seem to affect decomp though... it only affects stemmers and baseform filters.

However, in any case the keyword_marker should just affect the FIRST following filter and not ALL following filters. Otherwise words excluded from decomp also get excluded from the final stemming... which is not what I intended...

keyword_marker filter is respected in my decompounder plugin by using respect_keywords: true. The default is false.

@marbleman I think the better approach for you is to add an exception list to the decompounder plugin. I think that is what you are really doing anyway, is excluding a known list.

I second that a list of exceptions would be helpful, such that exceptions would affect only decompounder and not stemmer, as I may want to protect a word from decompounding still allowing it to be stemmed.

Such an exception list could be used to protect proper nouns from decompounding.