PhilippeSigaud / Pegged

A Parsing Expression Grammar (PEG) module, using the D programming language.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Ambiguous grammar handling

eraxillan opened this issue · comments

Hello and thanks for awesome library!

I'm trying to implement Qt's qmake project files parser from scratch.
But i've found two issues:

  • Ambiguous colon sign: it can be boolean AND operator and single-line code block delimiter
  • Regular expressions directly in function parameters, e.g. func(var, ([A-Z])+), which can contain unbalanced parenthesizes and made function call statement ambiguous too

So the qmake language cannot be described using PEG.
However, i've managed to reinvent a wheel add "preprocessor" code eliminating those issues using naive parser.
It's just replaces "delimiter colon" with "@" char, and enquote regular expressions.
The resulting code already can be parsed using PEG/pegged.

Does pegged have some built-in stuff to help fighting with such kind of ambiguous grammars?
Or just i chose the wrong tool.
Thanks!

P.S. Link to my project using pegged

use | instead of / to match longest match instead of first match (for ambiguous rules) or improve your grammar. Would need a more concrete example what you parse and what you expect (and how it can be different depending on other code) If it is dependent on semantic meaning you might be able to make it work with semantic actions.

for your regex you might want to try some rules like

Regex < RegexPart+
RegexPart < RegexPart2 RegexCountModifier?
RegexPart2 < RegexCaptureGroup / RegexBracketGroup / RegexChar / RegexMatchStart / RegexMatchEnd
RegexCountModifier < '?' / '*' / '+' / '{' UInt ',' UInt '}' / '{' UInt ',' '}' / '{' UInt '}'
RegexChar <~ backslash (
            / 'x' Hex Hex
            / 'u' Hex Hex Hex Hex
            / 'U' Hex Hex Hex Hex Hex Hex Hex Hex
            / .
           )
         / !RegexSpecialChar .
RegexMatchStart < '^'
RegexMatchEnd < '$'
RegexSpecialChar < '?' / '^' / '$' / '*' / '+' / '{' / '}' / '[' / ']' / '(' / ')'
RegexBracketGroup < '[' RegexInvertGroup? RegexIncludeClosingBracket? RegexBracketPart* ']'
RegexInvertGroup < '^'
RegexIncludeClosingBracket < ']'
RegexBracketPart < RegexBracketRange / RegexBracketChar
RegexBracketRange < RegexBracketChar '-' RegexBracketChar
RegexBracketChar < !']' .
RegexCaptureGroup < '(' RegexGroupModifier? RegexPart+ ')'
RegexGroupModifier < RegexPositiveLookahead / RegexNegativeLookahead / RegexNonCaptureGroup
RegexPositiveLookahead < '?='
RegexNegativeLookahead < '?!'
RegexNonCaptureGroup < '?:'
UInt <~ [0-9]+

I guess. (not well tested, best to make your own grammar to fully understand) I don't see how can it contain unbalanced parentheses?

Based my grammar mostly on a very lightweight PCRE syntax now, not sure what you have.

@WebFreak001 thanks for your answer! and sorry for such delay.

First of all, my grammar description.

  1. Unbalanced parenthesis example:
m = $$replace(out, ".*\\$\\(EXPORT_([^)]+)\\).*", \\1)
#                                                         ^
#                                                       oh no! unbalanced
# this code replace the value of 'out' variable using given regex and set it to 'm'

So, you think i should use regex grammar for such kind of parameter?

  1. Scope statement ambiguity example:
win32 : msvc                    :               CONFIG += win_stuff
#       ^                              ^
#  logical AND      end of logical statement
# this code add value 'win_stuff' to CONFIG variable only if bool condition 'win32 AND msvc' is true

Scope expression described on line 181:

 Scope             <- BooleanExpression ScopeMainBranch ScopeElseIfBranch* ScopeElseBranch?

Well, i'll think where i can use | operator here.

for 1 you should implement a proper string parser (as you have the value quoted, which would mean it couldn't be unbalanced parentheses if done properly), look at https://github.com/PhilippeSigaud/Pegged/blob/master/pegged/examples/strings.d

if your values are unquoted, look at my regex grammar I posted which will make it balanced too

for 2: if it is always in this order (everything before last colon is logical and) then it is easy to describe as rule using lookaheads like

Config     < Statement+
Statement  < Checks ":" Expression
                            # v  the & does the magic
Checks     < Check (":" Check &":")*
Check      < identifier
Expression < Variable Operator Value

if it's not always in this order, and dependent on the name of the identifier you write in there (being any arbitrary variable you defined earlier) you will either parse it as one list and resolve it later as part of your libraries API or try to implement it using semantic actions.