regex->trigram has wrong behaviour when trigrams are question-marked
GoogleCodeExporter opened this issue · comments
Google Code Exporter commented
bash$ csearch -verbose "foo_(bar_)?" >/dev/null
2012/03/05 10:44:37 query: "_ba" "foo" "o_b" "oo_"
2012/03/05 10:44:37 post query identified 11 possible files
bash$ csearch -verbose "foo_b(ar_)?" >/dev/null
2012/03/05 10:44:44 query: "foo" "o_b" "oo_"
2012/03/05 10:44:44 post query identified 12 possible files
In the first example, "_ba" and "o_b"' should not be required trigrams for the
file. In the second example, "ar_" is correctly _not_ included in the list of
required trigrams.
I'll take a look at this later tonight. It looks like an issue with precedence
in index/regexp.go. (I.e., the question mark is only applying to the final
trigram, and not all trigrams included in the grouping. ).
Original issue reported on code.google.com by dgryski
on 5 Mar 2012 at 10:07
Google Code Exporter commented
A bit of investigation has narrowed my focus to concat(x,y)'s handling when x
or y has canEmpty=true and a prefix/suffix list rather than a list of exact
matches. The hunt continues...
Original comment by dgryski
on 5 Mar 2012 at 6:29
- Added labels: ****
- Removed labels: ****
Google Code Exporter commented
I'm stumped. I think this bug is beyond my current familiarity with the
codebase. I'm certain the issue is with concat(), but after many hours of
staring at this I really have no idea what the fix should look like.
Original comment by dgryski
on 6 Mar 2012 at 8:40
- Added labels: ****
- Removed labels: ****
Google Code Exporter commented
So, it looks like this particular issue is actually in alternate(). The code
assumes that either x and y are both exact or both have prefix/suffix lists.
This obviously isn't the case.
The above bug is caused by the fact that emptyString() has an exact match but
no prefix/suffix lists, so when being combined we were unioning with a nil
list. This didn't crash, it just didn't have the expected behaviour.
This also fixes the case where, when searching for '(123|4567)q' , we would get
"67q" as the only required trigram, instead of the correct ("23q"|"67q").
Original comment by dgryski
on 7 Mar 2012 at 9:37
- Added labels: ****
- Removed labels: ****
Attachments:
Google Code Exporter commented
After writing that last sentence, I realized that my "corrected" trigrams were
still missing some. My implementation of alternate ignored the exact matches.
We now call addExact() before combining the matches. This allows the (this
time for sure ;) correct trigram set:
("123")|("456" "567") ("23q"|"67q")
to be generated.
Original comment by dgryski
on 7 Mar 2012 at 11:07
- Added labels: ****
- Removed labels: ****
Attachments:
Google Code Exporter commented
This issue was closed by revision 56b76ffbf8bb.
Original comment by dgryski
on 2 May 2012 at 8:26
- Changed state: Fixed
- Added labels: ****
- Removed labels: ****