Idea to improve decompounding

Question

Idea to improve decompounding

marbleman opened this issue 8 years ago · comments

In my current "analysis of analyzers" it turned out that lots of times when searches fail, it is due to wrongly decompounded words.

A lot of those words are on the list of the baseform dictionary already.

Maybe its a nobrainer but did you ever consider to add a dictionary functionality to the decompounder and feed the baseform dictionary to it to exclude those words from decompounding?

From my current point of view, this could boost results pretty much:

Examples that must not be decompounded: loskaufen, loslassen, hochziehen, hochdrücken...

Of course it could also be helpful to decompound these words, but in many cases they get decompounded in a wrong way since the pre-syllable is not detected as such by the decompounder. Finally it is a pitty when the baseform filter adds the baseform and the decompounder ruins it in the next step...
Another way could be to tag words to exclude them from further processing within the analyzer. I thought I have seen something like this somewhere but I cannot find it anymore.

Probably all thougts that you already had before... So what is your opinion?

Jan Viktor Apel commented 8 years ago

👍

Jörg Prante · Answer 1 · Tue May 10 2016 18:03:51 GMT+0800 (China Standard Time)

Decompounder is limited to nouns only. This requires some sort of tagging words with a POS analysis IMHO, like UIMA or Stanford POS tagger. This is something I'm interested in, but lack of time prevents improvements.

There is a keyword suppression technique, so the keyword marker filter can be used to exclude words https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-keyword-marker-tokenfilter.html

marbleman · Answer 2 · Tue May 10 2016 21:25:05 GMT+0800 (China Standard Time)

Thx for the link!! I'll give it a try and let you know.

Having a sort of POS could really help to distinguish between verbs, adjectives and nouns. At least treating nouns with the common "-ung" ending correctly could give a boost. Seems to be problematic atm.

I am very interested in improving the mechanisms here as well since I have to deal with technical literature having crazy compound words all over the place. Some of them must be decompounded, some must not be decompounded.
Let me know if see a way to bundle/concentrate our efforts in improving this.

marbleman · Answer 3 · Thu May 12 2016 01:58:48 GMT+0800 (China Standard Time)

Well, the result tells me:
The keyword_marker does not seem to affect decomp though... it only affects stemmers and baseform filters.

However, in any case the keyword_marker should just affect the FIRST following filter and not ALL following filters. Otherwise words excluded from decomp also get excluded from the final stemming... which is not what I intended...

Jörg Prante · Answer 4 · Wed Jul 13 2016 21:31:57 GMT+0800 (China Standard Time)

keyword_marker filter is respected in my decompounder plugin by using respect_keywords: true. The default is false.

Jayson Minard · Answer 5 · Mon Mar 13 2017 07:20:38 GMT+0800 (China Standard Time)

@marbleman I think the better approach for you is to add an exception list to the decompounder plugin. I think that is what you are really doing anyway, is excluding a known list.

Nikolai Krot · Answer 6 · Mon Apr 10 2017 18:20:49 GMT+0800 (China Standard Time)

I second that a list of exceptions would be helpful, such that exceptions would affect only decompounder and not stemmer, as I may want to protect a word from decompounding still allowing it to be stemmed.

Such an exception list could be used to protect proper nouns from decompounding.