Construct Phrases may result in double quotes when original search is quoted

Question

Construct Phrases may result in double quotes when original search is quoted

rpialum opened this issue 10 years ago · comments

There are a few issues that exist with the current constructPhrases logic when expanding synonyms. One of which can result in multiple quotes being applied when the original search is quoted.

ie: Search: "Internal Revenue Service" takes money
Synonyms: (IRS, tax service, internal revenue service)

Current results: "IRS" takes money; ""tax service"" takes money; ""internal revenue service"" takes money.

My proposed solution involves making a few fixes with in generateSynonymQueries(), all when SynonymDismaxParams.SYNONYMS_CONSTRUCT_PHRASES has been set to true.

Only apply quotes when the synonym term is a phrase (more than one term).
Only apply quotes when the synonym phrase is not already surrounded by quotes.

Changes:

Add to top of generateSynonymQueries():
String origQuery = getQueryStringFromParser();
int queryLen = origQuery.length();

// TODO: make the token stream reusable?
TokenStream tokenStream = synonymAnalyzer.tokenStream(SynonymDismaxConst.IMPOSSIBLE_FIELD_NAME,
new StringReader(origQuery));

Replace current phraseQuery if logic with:
if (constructPhraseQueries && typeAttribute.type().equals("SYNONYM") &&
termToAdd.contains(" "))
{
//Dont' Quote when original is already surrounded by quotes
if( offsetAttribute.startOffset()==0 ||
offsetAttribute.endOffset() == queryLen ||
origQuery.charAt(offsetAttribute.startOffset()-1)!='"' ||
origQuery.charAt(offsetAttribute.endOffset())!='"')
{
// make a phrase out of the synonym
termToAdd = new StringBuilder(termToAdd).insert(0,'"').append('"').toString();
}
}

Nolan Lawson · Answer 1 · Tue Jun 17 2014 01:06:56 GMT+0800 (China Standard Time)

Thanks for raising the issue. I agree it's a problem and will look into applying the patch. If you'd like the process to go faster, though, please submit a formal PR and also a unit test showing that your fix works. The unit tests are all done in Python and should be fairly easy to understand; there are instructions in the readme.

Jeremy · Answer 2 · Tue Jun 17 2014 22:31:25 GMT+0800 (China Standard Time)

Thanks for the fast response and for all the hard work you've done with this project. I assume PR refers to Pull Request (I've only ever used github for pulling code, rather than contributing). I'm also a novice when it comes to SOLR internals and configuration functionality (Filters/Tokenizers), though looking through yours and Tiens Multi-term synonym logic this past week has given me a bit of a crash course.

One quick question: How do the various Tokenizers and Filters interact when there are query time Tokenizer and Filter specified on the field being queried when using the synonym_edismax parser? Is there a specific order in which they are applied or does one super-cede the other orare they completely independent of the other?

Our schema file specifies the following in our schema file for the field we're expanding synonyms on:

<fieldType name="our_text" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="lang/stopwords_en.txt" />
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true"
            words="lang/stopwords_en.txt"  />
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>

Nolan Lawson · Answer 3 · Wed Jun 18 2014 00:01:35 GMT+0800 (China Standard Time)

@rpialum No problem, and yes PR is a pull request. :)

I'm not sure how to answer your question, but if you use the debug UI in the solr admin, you should be able to see how the filter factories and tokenizers are being successively applied to the input.

Also, as for your schema, you can add it to the sample schema, which is also what's used in the unit test. So that way, you should be able to get the unit tests running. I.e. these files are what's used in the unit tests. Ping me if anything else is unclear; hope that helps!

Nolan Lawson · Answer 4 · Sun Oct 05 2014 02:53:20 GMT+0800 (China Standard Time)

fixed in 242d330