wikimedia / search-highlighter

Github mirror of "search/highlighter" - our actual code is hosted with Gerrit (please see https://www.mediawiki.org/wiki/Developer_access for contributing

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Weird highlighting behaviour

mdomans opened this issue · comments

tl;dr The point of using experimental highligter for me is accomplishing less storage space while enabling multiple types of analysis on fields.

Longer:

A have a set of fields. Let's call one of them text. Because I want to be able to reap the benefits of multiple analyzers, the field text has a subfield - text.raw. On text.raw I don't lowercase or stem. So I can have very broad queries like dog matching Doggy or very precise where Dogs only matches dogs. And so on. The caveat is that I need to highlight results

Running this in production, where I store every subfield makes for insanely big index (north of 500GB). Same with FVH using matched_fields. So I really need the experimental highlighter matched_fields in this case.

Here's an example of the highlight part of the request:

"text": {
            "number_of_fragments": 0,
            "matched_fields": [
               "text",
               "text.raw"
            ],
            "type": "experimental"
}

As you can see - I'm trying to tie parent and subfield into single field for purpose of highlight.

Now: depending on how the query part of request is formed, I either get highlight results or not. Here's the isolated cause/difference in query structure that I see:

This query will work:

"query_string": {
                              "use_dis_max": true,
                              "query": "\"Eskimos\"",
                              "fields": [
                                  "description",
                                 "text.raw",
                              ]
                           }

And this won't:

"query_string": {
                              "use_dis_max": true,
                              "query": "\"Eskimos\"",
                              "fields": [
                                 "text.raw",
                              ]
                           }

and this won't either

"query_string": {
                              "use_dis_max": true,
                              "query": "\"Eskimos\"",
                              "fields": [
                                  "description.raw",
                                 "text.raw",
                              ]
                           }

Now, the single hit in this testing index is returned for every query. But, only in the 1st case it also includes the highlighting dict. Neither of raw fields is stored. The interesting part is that explain shows no need for description field - it doesn't contain the word Eskimos. Also, the description field can be replaced by any field that has stored option on.

My question: what's going on? Is there some kind of optimization in place here that needs to be force disabled of is this a cryptic bug?

I think we have a similar usecase, see how the highlight query is built on wikipedia: https://en.wikipedia.org/w/index.php?search=~tests&title=Special:Search&go=Go&cirrusDumpQuery

Would it be possible to share a simple test case (from index creation+mapping to query with expected and actual result), it'd help to understand what you are trying to achieve.

Thanks.