jprante / elasticsearch-plugin-bundle

A bundle of useful Elasticsearch plugins

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

"langdetect" mapping issue language code not retrievable

antonsar opened this issue · comments

Hello,

I am trying this plugin out to handle document with mixed languages. Unfortunately the type "langdetect" is causing some issue for me.

Here are some info that maybe useful:
ES version 5.1.1
This bundle plugin version 5.1.1.0
smart_cn analysis plugin - latest
kuromoji analysis plugin - latest

Then I did this (following the example):

curl -XDELETE 'localhost:9200/test'
curl -XPUT 'localhost:9200/test'
curl -XPOST 'localhost:9200/test/article/_mapping' -d '
{
"article" : {
"properties" : {
"content" : { "type" : "langdetect" }
}
}
}
'
curl -XPUT 'localhost:9200/test/article/1' -d '
{
"title" : "Some title",
"content" : "Oh, say can you see by the dawns early light, What so proudly we hailed at the twilights last gleaming?"
}
'

Finally I did the search after calling refresh

curl -XPOST 'localhost:9200/test/_search' -d '
{
"query" : {
"term" : {
"content" : "en"
}
}
}
'
However the search above returns 0 hit.

I double check the mapping and "content" now showing like this:

curl -XGET "localhost:9200/test/_mappings?pretty"

"content" : {
"type" : "langdetect",
"analyzer" : "_keyword",
"include_in_all" : false
}

calling curl -XGET 'localhost:9200/test/_search'
shows this

"_source" : {
"content" : "Oh, say can you see by the dawns early light, What so proudly we hailed at the twilights last gleaming?"
}

Based off the examples and the result I was getting, I don't think this is the intended behavior. How should I retrieve the detected language code ?

Thank You!

Hi JPrante,

Please let me know if you need further details. I appreciate it if you could take a look at this issue.

Essentially what I did is following the langdetect example from the README and it was not returning the correct result. The only factor that are different in my environment is I have 2 additional plugins (smart_cn, and kuromoji plugins).

Thanks in advance!

For langdetect in 5.1.1.0, you have to explicitly declare all languages you want to be detected, like this

PUT /test
{
   "mappings": {
      "article": {
         "properties": {
            "content": {
               "type": "langdetect",
               "languages" : [ "en", "de", "fr" ]
            }
         }
      }
   }
}

Bundle 5.1.1.0 is a preview release, not official.

ICU and hyphen is working and documented, all other analyzers are not reviewed and not well documented. The docs and examples are out of sync. Bugs and changes are to be expected. They will be fixed and documented in future versions.

Thank you so much for the clarification!!!

Thanks again