Ambiguous grammar handling
eraxillan opened this issue · comments
Hello and thanks for awesome library!
I'm trying to implement Qt's qmake
project files parser from scratch.
But i've found two issues:
- Ambiguous colon sign: it can be boolean
AND
operator and single-line code block delimiter - Regular expressions directly in function parameters, e.g.
func(var, ([A-Z])+)
, which can contain unbalanced parenthesizes and made function call statement ambiguous too
So the qmake language cannot be described using PEG.
However, i've managed to reinvent a wheel add "preprocessor" code eliminating those issues using naive parser.
It's just replaces "delimiter colon" with "@" char, and enquote regular expressions.
The resulting code already can be parsed using PEG/pegged.
Does pegged
have some built-in stuff to help fighting with such kind of ambiguous grammars?
Or just i chose the wrong tool.
Thanks!
P.S. Link to my project using pegged
use |
instead of /
to match longest match instead of first match (for ambiguous rules) or improve your grammar. Would need a more concrete example what you parse and what you expect (and how it can be different depending on other code) If it is dependent on semantic meaning you might be able to make it work with semantic actions.
for your regex you might want to try some rules like
Regex < RegexPart+
RegexPart < RegexPart2 RegexCountModifier?
RegexPart2 < RegexCaptureGroup / RegexBracketGroup / RegexChar / RegexMatchStart / RegexMatchEnd
RegexCountModifier < '?' / '*' / '+' / '{' UInt ',' UInt '}' / '{' UInt ',' '}' / '{' UInt '}'
RegexChar <~ backslash (
/ 'x' Hex Hex
/ 'u' Hex Hex Hex Hex
/ 'U' Hex Hex Hex Hex Hex Hex Hex Hex
/ .
)
/ !RegexSpecialChar .
RegexMatchStart < '^'
RegexMatchEnd < '$'
RegexSpecialChar < '?' / '^' / '$' / '*' / '+' / '{' / '}' / '[' / ']' / '(' / ')'
RegexBracketGroup < '[' RegexInvertGroup? RegexIncludeClosingBracket? RegexBracketPart* ']'
RegexInvertGroup < '^'
RegexIncludeClosingBracket < ']'
RegexBracketPart < RegexBracketRange / RegexBracketChar
RegexBracketRange < RegexBracketChar '-' RegexBracketChar
RegexBracketChar < !']' .
RegexCaptureGroup < '(' RegexGroupModifier? RegexPart+ ')'
RegexGroupModifier < RegexPositiveLookahead / RegexNegativeLookahead / RegexNonCaptureGroup
RegexPositiveLookahead < '?='
RegexNegativeLookahead < '?!'
RegexNonCaptureGroup < '?:'
UInt <~ [0-9]+
I guess. (not well tested, best to make your own grammar to fully understand) I don't see how can it contain unbalanced parentheses?
Based my grammar mostly on a very lightweight PCRE syntax now, not sure what you have.
@WebFreak001 thanks for your answer! and sorry for such delay.
First of all, my grammar description.
- Unbalanced parenthesis example:
m = $$replace(out, ".*\\$\\(EXPORT_([^)]+)\\).*", \\1)
# ^
# oh no! unbalanced
# this code replace the value of 'out' variable using given regex and set it to 'm'
So, you think i should use regex grammar for such kind of parameter?
- Scope statement ambiguity example:
win32 : msvc : CONFIG += win_stuff
# ^ ^
# logical AND end of logical statement
# this code add value 'win_stuff' to CONFIG variable only if bool condition 'win32 AND msvc' is true
Scope expression described on line 181:
Scope <- BooleanExpression ScopeMainBranch ScopeElseIfBranch* ScopeElseBranch?
Well, i'll think where i can use |
operator here.
for 1 you should implement a proper string parser (as you have the value quoted, which would mean it couldn't be unbalanced parentheses if done properly), look at https://github.com/PhilippeSigaud/Pegged/blob/master/pegged/examples/strings.d
if your values are unquoted, look at my regex grammar I posted which will make it balanced too
for 2: if it is always in this order (everything before last colon is logical and) then it is easy to describe as rule using lookaheads like
Config < Statement+
Statement < Checks ":" Expression
# v the & does the magic
Checks < Check (":" Check &":")*
Check < identifier
Expression < Variable Operator Value
if it's not always in this order, and dependent on the name of the identifier you write in there (being any arbitrary variable you defined earlier) you will either parse it as one list and resolve it later as part of your libraries API or try to implement it using semantic actions.