healthonnet / hon-lucene-synonyms

Solr query parser plugin that performs proper query-time synonym expansion.

Home Page:http://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Construct Phrases may result in double quotes when original search is quoted

rpialum opened this issue · comments

There are a few issues that exist with the current constructPhrases logic when expanding synonyms. One of which can result in multiple quotes being applied when the original search is quoted.

ie: Search: "Internal Revenue Service" takes money
Synonyms: (IRS, tax service, internal revenue service)

Current results: "IRS" takes money; ""tax service"" takes money; ""internal revenue service"" takes money.

My proposed solution involves making a few fixes with in generateSynonymQueries(), all when SynonymDismaxParams.SYNONYMS_CONSTRUCT_PHRASES has been set to true.

  1. Only apply quotes when the synonym term is a phrase (more than one term).
  2. Only apply quotes when the synonym phrase is not already surrounded by quotes.

Changes:

Add to top of generateSynonymQueries():
String origQuery = getQueryStringFromParser();
int queryLen = origQuery.length();

// TODO: make the token stream reusable?
TokenStream tokenStream = synonymAnalyzer.tokenStream(SynonymDismaxConst.IMPOSSIBLE_FIELD_NAME,
new StringReader(origQuery));

Replace current phraseQuery if logic with:
if (constructPhraseQueries && typeAttribute.type().equals("SYNONYM") &&
termToAdd.contains(" "))
{
//Dont' Quote when original is already surrounded by quotes
if( offsetAttribute.startOffset()==0 ||
offsetAttribute.endOffset() == queryLen ||
origQuery.charAt(offsetAttribute.startOffset()-1)!='"' ||
origQuery.charAt(offsetAttribute.endOffset())!='"')
{
// make a phrase out of the synonym
termToAdd = new StringBuilder(termToAdd).insert(0,'"').append('"').toString();
}
}

Thanks for raising the issue. I agree it's a problem and will look into applying the patch. If you'd like the process to go faster, though, please submit a formal PR and also a unit test showing that your fix works. The unit tests are all done in Python and should be fairly easy to understand; there are instructions in the readme.

Thanks for the fast response and for all the hard work you've done with this project. I assume PR refers to Pull Request (I've only ever used github for pulling code, rather than contributing). I'm also a novice when it comes to SOLR internals and configuration functionality (Filters/Tokenizers), though looking through yours and Tiens Multi-term synonym logic this past week has given me a bit of a crash course.

One quick question: How do the various Tokenizers and Filters interact when there are query time Tokenizer and Filter specified on the field being queried when using the synonym_edismax parser? Is there a specific order in which they are applied or does one super-cede the other orare they completely independent of the other?

Our schema file specifies the following in our schema file for the field we're expanding synonyms on:

<fieldType name="our_text" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="lang/stopwords_en.txt" />
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true"
            words="lang/stopwords_en.txt"  />
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>

@rpialum No problem, and yes PR is a pull request. :)

I'm not sure how to answer your question, but if you use the debug UI in the solr admin, you should be able to see how the filter factories and tokenizers are being successively applied to the input.

Also, as for your schema, you can add it to the sample schema, which is also what's used in the unit test. So that way, you should be able to get the unit tests running. I.e. these files are what's used in the unit tests. Ping me if anything else is unclear; hope that helps!