Quoted sequence in range bound parsed incorrectly (PCRE.g4)

Question

Quoted sequence in range bound parsed incorrectly (PCRE.g4)

david-wahlstedt opened this issue 2 months ago · comments

I tried the PCRE.g4 grammar on an example where the bounds in a character class are quoted sequences. The parser took \E and \Q as bounds, when it should have been the minimal element in the first quoted sequence (a) and the maximum in the second (d). I also tried to have only one of the bounds as a quoted sequence with similar result.

echo -n '[\Qba\E-\Qdc\E]'|antlr4-parse PCRE.g4 pcre -tree
(pcre:1 (alternation:1 (expr:1 (element:1 (atom:13 (character_class:2 [ (character_class_atom:5 \ Q) (character_class_atom:6 b) (character_class_atom:6 a) (character_class_atom:1 (character_class_range:1 (character_class_range_atom:2 \ E) - (character_class_range_atom:2 \ Q))) (character_class_atom:6 d) (character_class_atom:6 c) (character_class_atom:5 \ E) ]))))) <EOF>)

This is how it behaves with pcre2test: (I also tried it for PCRE (ver 1) using https://regex101.com/ with correct result)

echo "/[\Qba\E-\Qdc\E]/info@c" | tr -s "@" "\n"|pcre2test -b
PCRE2 version 10.39 2021-10-29
/[\Qba\E-\Qdc\E]/info
------------------------------------------------------------------
  0  36 Bra
  3     [a-d]
 36  36 Ket
 39     End
------------------------------------------------------------------
Capture group count = 0
Starting code units: a b c d 
Subject length lower bound = 1
c
 0: c

I understand that the example is weird, but it could perhaps be handled differently, like letting the \Qba\E block be a possible value as range endpoint, and then let the semantics take care of what it means.

I also have a question I don't know where to put: are you going to support PCRE2 any time soon?
This is the only formal grammar I have found out there for PCRE. Great job!!!

Best regards, David

Bart Kiers · Answer 1 · Wed Jun 05 2024 00:45:32 GMT+0800 (China Standard Time)

Didn't really test this, but you could try:

character_class_atom
    : quoting                          // <-- added
    | character_class_range
    | posix_character_class
    | character
    | character_type
    // | '\\' .                        // <-- removed
    | ~( '\\' | ']')
    ;

...

quoting
    : '\\' 'Q' .*? '\\' 'E'            // <-- swapped with the alt below
    | '\\' .
    ;

david-wahlstedt · Answer 2 · Fri Jun 07 2024 01:55:41 GMT+0800 (China Standard Time)

Thanks! It works!
I am implementing a PCRE2 parser using BNFC, a Haskell based tool. One provides an LBNF grammar, and gets a data type, a parser, a pretty printer, and a case skeleton for the type. I don't know if BNFC is the best choice, but it's fun to try. It is quite challenging, and I have looked at your grammar to help understand the grammatical structure of PCRE, along wiht the man pages. There is not much formal information out there about the grammar of PCRE: your grammar is the only one I've found. Nice work, thanks! I tried your examples, and I see some of the examples, e.g. in various.txt are not valid according to regex101.com. I can send more feedback on that later!
To provoke an error I tried \Qd\Ec\E on your patched parser, and it gave

echo -n '\Qd\Ec\E'|antlr4-parse PCRE.g4 pcre -tree
(pcre:1 (alternation:1 (expr:1 (element:1 (atom:19 (quoting:1 \ Q d \ E))) (element:1 (atom:15 (letter:1 c))) (element:1 (atom:19 (quoting:2 \ E))))) <EOF>)

The exit status was 0 and I don't see any error message. The \E in the end should trigger an error, I think, at least regex101.com does. But pcre2test does not: it seems a bit forgiving. I wonder if there is any reference tool that can really check whether or not a PCRE/PCRE2 expression is valid or not? (sorry for my too long answer)

David

Bart Kiers · Answer 3 · Fri Jun 07 2024 02:10:54 GMT+0800 (China Standard Time)

It's been quite a while, but I think I wrote the grammar "forgivingly" as well, making it the responsibility of the semantic phase to report any incorrect tokens (like the dangling \E). Good to hear the grammar is of help to you. Best of luck!

david-wahlstedt · Answer 4 · Fri Jun 07 2024 02:27:06 GMT+0800 (China Standard Time)

Yes, it makes sense to leave some parts to the semantic side: especially when so much depends on which options are set, version, etc. Thanks!