Inefficiencies in regex optimization

Question

Inefficiencies in regex optimization

goodmami opened this issue 3 years ago · comments

Michael Wayne Goodman commented 3 years ago

The regex optimization is sometimes producing overly complex regexes. Consider:

>>> _ = pe.compile(r'"a"*', flags=pe.DEBUG|pe.OPTIMIZE)
## Grammar ##
Start <- "a"*
## Modified Grammar ##
Start <- `(?=(?P<_1>(?:a)*))(?P=_1)`

Here, the single character 'a' does not need a non-capturing group as the following are equivalent:

(?=(?P<_1>(?:a)*))(?P=_1)
(?=(?P<_1>a*))(?P=_1)

I doubt this has much effect on performance, however. Now there's this:

>>> _ = pe.compile(r'"a"* / "b"', flags=pe.DEBUG|pe.OPTIMIZE)  # collapses to single regex
## Grammar ##
Start <- "a"* / "b"
## Modified Grammar ##
Start <- `(?=(?P<_2>(?=(?P<_1>(?:a)*))(?P=_1)|b))(?P=_2)`
>>> _ = pe.compile(r'"a"* / ~"b"', flags=pe.DEBUG|pe.OPTIMIZE)  # alternative cannot be collapsed
## Grammar ##
Start <- "a"* / ~"b"
## Modified Grammar ##
Start <- `(?=(?P<_2>(?=(?P<_1>(?:a)*))(?P=_1)))(?P=_2)` / ~`b`

There's no issue (aside from the superfluous non-capturing group) in the first one, but in the second where the alternative is blocked from collapsing into a single regex because of the semantic effect of the capture, the second lookahead/backreference is still there. That is, these are equivalent:

(?=(?P<_2>(?=(?P<_1>(?:a)*))(?P=_1)))(?P=_2)
(?=(?P<_1>(?:a)*))(?P=_1)

The example is from https://lists.csail.mit.edu/pipermail/peg/2021-October/000793.html. The parsing behavior is correct here, but the regex could be cleaner, which might help with debugging.