Cannot compile when user c string has a special pattern.

Question

Cannot compile when user c string has a special pattern.

krishna116 opened this issue 3 years ago · comments

For example this cannot compile:

int lex(const char *YYCURSOR) 
{
    for(;;)
    {
    /*!re2c
        re2c:define:YYCTYPE = char;
        re2c:yyfill:enable = 0;
        
        [123]   { break; }
    */
    break;
    }
    return 0;
}

void strange_string1()
{
    const char str[] = "%{}"; //error here.
}

void strange_string2()
{
    const char str[] = "%{{}}";  //error here.
}

re2c version: 2.2
compile: re2c.exe test.lex -o test.c

Thanks.

Ulya Trofimovich · Answer 1 · Wed Dec 22 2021 15:35:36 GMT+0800 (China Standard Time)

This is because re2c thinks that %{ is the beginning of a re2c block (historically it allows one to use YACC-style markers %{ and %} for the beginning and the end of a block).

The easiest fix for you would be to break the C-string at this point: use "%" "{}" instead of "%{}" (and likewise with the second string). C preprocessor will concatenate the two strings into one, so no other changes needed.

Ulya Trofimovich · Answer 2 · Wed Dec 22 2021 15:56:19 GMT+0800 (China Standard Time)

I will see if I can fix re2c to not recognize simple cases like this one as the beginning of a block.

Krishna · Answer 3 · Wed Dec 22 2021 18:22:48 GMT+0800 (China Standard Time)

Yes, I know flex also has marker %{ for option, I comment this, the error still happened, it is strange...

    // "{} %{}"; //error
    // "%{--}"; //error

Ulya Trofimovich · Answer 4 · Wed Dec 22 2021 19:16:15 GMT+0800 (China Standard Time)

Comment doesn't help because it's outside of the re2c block, and re2c can only recognize comments inside of blocks. re2c does not parse code in other languages, otherwise it would have to include all the complexity of a compiler frontend for each of the languages it supports (C/C++, Go or Rust). That would add a lot of complexity for very little value. The idea of re2c is that it doesn't care about the contents of the file outside of re2c blocks: it can be anything, and re2c will copy-paste it verbatim into the output file. Therefore re2c doesn't recognize // comments in the user-defined code outside of re2c blocks. It merely looks for the start of the block, which can be /*!re2c or %{.

You can break the string in two as I suggested: "%" "{}".

I will try to fix this a bit later (maybe require that %{ is preceded by a newline).

Ulya Trofimovich · Answer 5 · Wed Dec 22 2021 19:17:42 GMT+0800 (China Standard Time)

If you are interested, here is how re2c lexer handles %{: https://github.com/skvadrik/re2c/blob/master/src/parse/lex.re#L139

Krishna · Answer 6 · Thu Dec 23 2021 08:45:06 GMT+0800 (China Standard Time)

I can break my string to "%" "{}" because I just do some test. I had went to the code yestoday, I think it is hard to adjust, because it ought to ignore comment but comment is important for re2c sometimes, or It ought to remember the state inside re2c block or outside re2c block but it is a block-marker...

May be it can be deprecated then removed or/and give an option to --using-deprecated-feature...

thanks.

Ulya Trofimovich · Answer 7 · Thu Dec 23 2021 16:09:18 GMT+0800 (China Standard Time)

I cannot deprecate this feature because some real-world re2c code may rely on it. Imagine an old project that has been using re2c for many years, and suddenly it breaks because re2c decided to deprecate %{. It is not a good situation, even if the fix is as simple as adding an option. But if I can make %{ match only at the beginning of a line, that would solve most of the problems.

Ulya Trofimovich · Answer 8 · Sat Dec 25 2021 04:29:08 GMT+0800 (China Standard Time)

Commit dba7d05 fixes the problem described in this bug. It does not fix all possible cases (see the added test https://github.com/skvadrik/re2c/blob/dba7d055aa02f55482d69a279016460dedd9a380/test/layout/flex_braces.re) but the remaining ones are more rare.

Krishna · Answer 9 · Sat Dec 25 2021 20:36:47 GMT+0800 (China Standard Time)

I think you are right, if the %{ appear at line begin it will be much better. I meet this bug when I use re2c to parse something like printf args... And now it is much better, thank you for your effort.

By the way I would like ask you a question, If I want to read and understand your paper "https://re2c.org/2017_trofimovich_tagged_deterministic_finite_automata_with_lookahead.pdf"
, what are the best steps to do and what basic knowledge I should know before I read it?
(I just have little knowledge about NFA/DFA/DFA-minimize from the dragon book).

Ulya Trofimovich · Answer 10 · Sun Dec 26 2021 18:20:36 GMT+0800 (China Standard Time)

what are the best steps to do and what basic knowledge I should know before I read it?

I suggest the following approach.

Step 1 - understanding the problem

First you should understand the basic concepts (I provided some wikipedia links, but all these things are well-known and easy to google, so don't hesitate to follow the links on wikipedia to other articles, papers and books):

Thompson's NFA construction
NFA simulation
Determinization by powerset construction

I suggest that you manually go through a series of examples with pencil and paper: start with some regular expression, convert it to a Thompson's NFA, simulate the NFA on some strings, then convert it to a DFA, then execute DFA on some strings.

Second, ask yourself a question: what if I want to mark some point in a regular expression (add a tag) and find out where it is in the matched string? Revisit your examples, but now add some tags in the regular expression, translate them to tagged epsilon-transitions in the NFA, and attempt to do tagged NFA simulation and determinization. Think about the changes that are needed in these algorithms in order to track tag values. Start with easy non-ambiguous examples like a* @t b* where @t is a tag.

You may find these slides with examples useful.

Think about ambiguity in regular expressions, when it is possible to match in more than one way. Consider the following ambiguous cases with POSIX-style capturing parenthesis:

ambiguous concatenation, e.g. (a|ab)(bc|c) matched against "abc" can result in "ab", "c" or in "a", "bc"
ambiguous alternative, e.g. (a*)|(a) matched against "a" can result in "a", "" or in "", "a"
ambiguous iteration, e.g. (a*)* matched against "aa" can result in "a", "a" or in "aa"

Step 2 - reading the papers

The original paper by Laurikari
Short introductory article about TDFA
I am writing a new detailed paper about TDFA together with Angelo Borsotti. it is not finished yet, but the main part is ready and it contains a detailed example at the end.

Krishna · Answer 11 · Sun Dec 26 2021 18:42:17 GMT+0800 (China Standard Time)

I think it is may be not a easy thing to me but I will try to do, as I find TDFA is interesting.

Thank you for your answer.