jprante / elasticsearch-plugin-bundle

A bundle of useful Elasticsearch plugins

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Settings definition different between orginal langdetect and bundle

marbleman opened this issue · comments

Had a hard time today figuring out why my application slowed down around 20 times. After a lot of profiling I found langdetect to be the issue. Finally compared orginal langdetect plugin and the plugin-bundle and wrote a unitTests to measure execution time.

The reason is quite simple: orignal langdetect plugin assumes settings as
langdetect.languages = en,de,fr
while the plugin-bundle wants to see
languages = en,de,fr
in elasticsearch.yml

This applies to all settings (compare src\main\java\org\xbib\elasticsearch\module\langdetect\LangdetectService.java for details)

Is this intended? If yes, I will push an update to the docs...

BTW: I also tried the parameter ?profile=/langdetect/short-text/ since it appeared to me it could speed up detection (probably at cost of accuracy). But in all my tries I always got "profile": "/langdetect/" returned.

You're working so hard to find the differences between those two incarnations of the plugin... this helps a lot in aligning them!

Surely differences were never intended, codebases should be the same. The reason why they diverge was focusing on the "bundle" for a more comprehensive installation in my production environment, leaving the "langdetect-plugin" a bit behind. I got some internal feedback for the "bundle" that never made it back to the other version. Sorry for the mess.

BTW there are also some junit tests missing in the "bundle" which are present in "langdetect-plugin".

Would have saved a lot of time if I had the idea to compare the two code folders earlier... ;)

Which codebase is intended to be the Master? I guess the single plugins since they carry the most detailed documentation, right?

I am not a Java developer by nature so it will take me a lot of effort and time to set up a functional development environment for all this stuff. I promise, I will do some time ;) Maybe you have a good tip for a starting point/howto. I wrote the unitTest mentioned against the PHP implementation though.

So for now all I can offer is to help with the docs and testing. Is there a way to get notifications on changes similar to code reviews? This would help to check immediately when implementation and documentaion go out of sync. Would rather invest the time here where everyone benefits than spending hours in reverse engineering on issues like the one above... ;)

I see you are investing a lot of your time into langdetect right now, so will do the alignment of both codebases in the next hours, in the hope I can clear up the mess a bit. There are parts in both which belong to current state.

Watching a github project should give you notifications about commits, but I'm not sure :(

I'll give it a try. Let me know if I can be of any help.

BTW: I figured out that reducing the languages to test as described above will leed to wrong results instead of no result or at least a low probability:

E.g. I limit detection to de,en and send in a french text. The result gives me "en" with a probability of 0.99!

First commit is here in my alignment quest.

jprante/elasticsearch-langdetect@ba72272

Plugin bundle will follow.

So here is the second commit to align both langdetect

48b27ba

Just came across something that confuses me: Thought you had mentioned you wanted to go for ISO-639-1 codes in langdetect (de, en... instead of ger,eng) ?

Current bundle 2.2.0.3 returns ger, eng...

Oh, and I stumbled over some details regrading limiting the detected languages in yml that could use some extra documentation but the intention is still a bit unclear to me: I limited detection to de, en because detecting all languages takes too much time. Now I send a russian text in and get a probability of 0.999xxx for either de or en. Would expect a much lower probability or even an empty result instead. Am I wrong?

I drilled down on the ISO issue and figured out that the repo already contains a language.json with de/en... while elasticsearch-plugin-bundle-2.2.0.3-plugin.zip still contains the old one with ger/eng...