olivernn / lunr.js

A bit like Solr, but much smaller and not as bright

Home Page:http://lunrjs.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Searching for things like "--hard" or "--help" breaks the search/returns no results

MikeArsenault opened this issue · comments

There seems to be a problem regarding escaping multiple characters, in that the search does not seem to understand back to back escaped characters. For example, we know there are 6 results for --help in the handbook.

As expected, searching --help leads you to the infinite load issue, and the following console error (search):

https://d.pr/i/vDlyYe

  • Searching --help returns no queries.
  • Searching --help returns instances of -help such as slack channels with -help in the name.
  • Searching ---help returns no results.
  • Searching using html entities returns that you have not searched for anything. These include −, − and −.
  • Searching with URL encoding dashes.

We are wondering if this is by design and we just haven't determined the right escape format? Our version of lunr is 2.3.7.

According to Docs in Search + or - will determine the presence and Adsense of terms

So if you search for idx.search('+') or idx.search('++any_word') it will throw error expecting term or field, found nothing
so each + or - must be followed with term

Do you have an example of the search string you are using? You mention that back to back escapes do not work, can you provide an example of how you are escaping back to back characters?

A backslash is used to escape characters that would otherwise have meaning in a query, so, for example, I would expect \-\-help to work.

If you can setup a minimal reproduction demonstrating the issue in something like jsfiddle (or similar) that'd be a great help.

I'm experiencing issues with escaping as well.
I have an example from the demo.
Search for flight\-\-a and it won't find anything, although the string flight--a exists in article number 2.

hello, I'm looking into the same issue; trying to escape a +. Escaping with \, as mentioned in the docs, does not seem to work. I think @gilisho's example demonstrates the issue well.

Instead of using Index.search I'm now trying to use Index.query.
Using directly the index from @gilisho's example site, I am trying the following:

idx.search("flight")
# [Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, ] (12)

idx.search("flight--a")
# [Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, ] (12)

idx.search("flight\-\-a")
# [Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, ] (12)

I think that's because the - and \ are removed by the tokenizer:

lunr.tokenizer("flight--a")
# Array (2) = $7
# 0 {str: "flight", metadata: {position: [0, 6], index: 0}, toString: function, update: function, clone: function}
# 1 {str: "a", metadata: {position: [8, 1], index: 1}, toString: function, update: function, clone: function}

lunr.tokenizer("flight\-\-a")
# Array (2) = $7
# 0 {str: "flight", metadata: {position: [0, 6], index: 0}, toString: function, update: function, clone: function}
# 1 {str: "a", metadata: {position: [8, 1], index: 1}, toString: function, update: function, clone: function}

Using the Index.query API:

idx.query(q => q.term(lunr.tokenizer("flight--a")))
# [Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, ] (12)

idx.query(q => q.term(lunr.tokenizer("flight\-\-a")))
# [Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, ] (12)

That was expected because the tokenizer removed the part we were interested in.

But, with the snippet below, I expected I would get back some results:

idx.query(q => q.term("flight--a"))
# []

To verify that the special meaning of - is not used with the Index.query API I did

idx.search("-")
# QueryParseError: expecting term or field, found nothing

idx.search("--")
# QueryParseError: expecting term or field, found 'PRESENCE'

idx.query(q => q.term(lunr.tokenizer("-")))
# [Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, ] (12)

idx.query(q => q.term(lunr.tokenizer("--")))
# [Object, Object, Object, Object, Object, Object, Object, Object, Object, Object, ] (12)

Any hints on this @olivernn ?

Essentially this is caused by the same issue as #481 and #245 --- either you remove the trimmer from the pipeline and/or customize the tokenizer.

That is,

  • when you build the index, you must not remove the -- from the tokens.
  • when you parse the search query, you also must not remove the -- from the tokens.

By using query(q => q.term(…)), you achieved the second point. To achieve the first point you need to modify the indexer.

var index=lunr(function(){
	this.pipeline.reset();  // NOTE 1. reset the pipeline
	this.ref("ref");
	this.field("title");
	this.add({ref: "a", title: ["--"]}); // NOTE 2. put the field in an array so tokenizer doesn't try to split it, each array element become one token
})

Then:

index.query(q=>q.term("--"))
index.search(String.raw `\-\-`)