google / codesearch

Fast, indexed regexp search over large file trees

Home Page:http://swtch.com/~rsc/regexp/regexp4.html

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

regex->trigram has wrong behaviour when trigrams are question-marked

GoogleCodeExporter opened this issue · comments

bash$ csearch -verbose "foo_(bar_)?" >/dev/null
2012/03/05 10:44:37 query: "_ba" "foo" "o_b" "oo_"
2012/03/05 10:44:37 post query identified 11 possible files
bash$ csearch -verbose "foo_b(ar_)?" >/dev/null
2012/03/05 10:44:44 query: "foo" "o_b" "oo_"
2012/03/05 10:44:44 post query identified 12 possible files

In the first example, "_ba" and "o_b"' should not be required trigrams for the 
file.  In the second example, "ar_" is correctly _not_ included in the list of 
required trigrams.

I'll take a look at this later tonight.  It looks like an issue with precedence 
in index/regexp.go.  (I.e., the question mark is only applying to the final 
trigram, and not all trigrams included in the grouping. ).

Original issue reported on code.google.com by dgryski on 5 Mar 2012 at 10:07

A bit of investigation has narrowed my focus to concat(x,y)'s handling when x 
or y has canEmpty=true and a prefix/suffix list rather than a list of exact 
matches.  The hunt continues...

Original comment by dgryski on 5 Mar 2012 at 6:29

  • Added labels: ****
  • Removed labels: ****
I'm stumped.  I think this bug is beyond my current familiarity with the 
codebase.  I'm certain the issue is with concat(), but after many hours of 
staring at this I really have no idea what the fix should look like.

Original comment by dgryski on 6 Mar 2012 at 8:40

  • Added labels: ****
  • Removed labels: ****
So, it looks like this particular issue is actually in alternate().  The code 
assumes that either x and y are both exact or both have prefix/suffix lists.  
This obviously isn't the case.

The above bug is caused by the fact that emptyString() has an exact match but 
no prefix/suffix lists, so when being combined we were unioning with a nil 
list.  This didn't crash, it just didn't have the expected behaviour.

This also fixes the case where, when searching for '(123|4567)q' , we would get 
"67q" as the only required trigram, instead of the correct ("23q"|"67q").

Original comment by dgryski on 7 Mar 2012 at 9:37

  • Added labels: ****
  • Removed labels: ****

Attachments:

After writing that last sentence, I realized that my "corrected" trigrams were 
still missing some.  My implementation of alternate ignored the exact matches.

We now call addExact() before combining the matches.  This allows the (this 
time for sure ;) correct trigram set: 
    ("123")|("456" "567")  ("23q"|"67q")
to be generated.


Original comment by dgryski on 7 Mar 2012 at 11:07

  • Added labels: ****
  • Removed labels: ****

Attachments:

This issue was closed by revision 56b76ffbf8bb.

Original comment by dgryski on 2 May 2012 at 8:26

  • Changed state: Fixed
  • Added labels: ****
  • Removed labels: ****