skvadrik / re2c

Lexer generator for C, C++, Go and Rust.

Home Page:https://re2c.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Cannot compile when user c string has a special pattern.

krishna116 opened this issue · comments

For example this cannot compile:

int lex(const char *YYCURSOR) 
{
    for(;;)
    {
    /*!re2c
        re2c:define:YYCTYPE = char;
        re2c:yyfill:enable = 0;
        
        [123]   { break; }
    */
    break;
    }
    return 0;
}

void strange_string1()
{
    const char str[] = "%{}"; //error here.
}

void strange_string2()
{
    const char str[] = "%{{}}";  //error here.
}

re2c version: 2.2
compile: re2c.exe test.lex -o test.c

Thanks.

This is because re2c thinks that %{ is the beginning of a re2c block (historically it allows one to use YACC-style markers %{ and %} for the beginning and the end of a block).

The easiest fix for you would be to break the C-string at this point: use "%" "{}" instead of "%{}" (and likewise with the second string). C preprocessor will concatenate the two strings into one, so no other changes needed.

I will see if I can fix re2c to not recognize simple cases like this one as the beginning of a block.

Yes, I know flex also has marker %{ for option, I comment this, the error still happened, it is strange...

    // "{} %{}"; //error
    // "%{--}"; //error

Comment doesn't help because it's outside of the re2c block, and re2c can only recognize comments inside of blocks. re2c does not parse code in other languages, otherwise it would have to include all the complexity of a compiler frontend for each of the languages it supports (C/C++, Go or Rust). That would add a lot of complexity for very little value. The idea of re2c is that it doesn't care about the contents of the file outside of re2c blocks: it can be anything, and re2c will copy-paste it verbatim into the output file. Therefore re2c doesn't recognize // comments in the user-defined code outside of re2c blocks. It merely looks for the start of the block, which can be /*!re2c or %{.

You can break the string in two as I suggested: "%" "{}".

I will try to fix this a bit later (maybe require that %{ is preceded by a newline).

If you are interested, here is how re2c lexer handles %{: https://github.com/skvadrik/re2c/blob/master/src/parse/lex.re#L139

I can break my string to "%" "{}" because I just do some test. I had went to the code yestoday, I think it is hard to adjust, because it ought to ignore comment but comment is important for re2c sometimes, or It ought to remember the state inside re2c block or outside re2c block but it is a block-marker...

May be it can be deprecated then removed or/and give an option to --using-deprecated-feature...

thanks.

I cannot deprecate this feature because some real-world re2c code may rely on it. Imagine an old project that has been using re2c for many years, and suddenly it breaks because re2c decided to deprecate %{. It is not a good situation, even if the fix is as simple as adding an option. But if I can make %{ match only at the beginning of a line, that would solve most of the problems.

Commit dba7d05 fixes the problem described in this bug. It does not fix all possible cases (see the added test https://github.com/skvadrik/re2c/blob/dba7d055aa02f55482d69a279016460dedd9a380/test/layout/flex_braces.re) but the remaining ones are more rare.

I think you are right, if the %{ appear at line begin it will be much better. I meet this bug when I use re2c to parse something like printf args... And now it is much better, thank you for your effort.

By the way I would like ask you a question, If I want to read and understand your paper "https://re2c.org/2017_trofimovich_tagged_deterministic_finite_automata_with_lookahead.pdf"
, what are the best steps to do and what basic knowledge I should know before I read it?
(I just have little knowledge about NFA/DFA/DFA-minimize from the dragon book).

what are the best steps to do and what basic knowledge I should know before I read it?

I suggest the following approach.

Step 1 - understanding the problem

First you should understand the basic concepts (I provided some wikipedia links, but all these things are well-known and easy to google, so don't hesitate to follow the links on wikipedia to other articles, papers and books):

I suggest that you manually go through a series of examples with pencil and paper: start with some regular expression, convert it to a Thompson's NFA, simulate the NFA on some strings, then convert it to a DFA, then execute DFA on some strings.

Second, ask yourself a question: what if I want to mark some point in a regular expression (add a tag) and find out where it is in the matched string? Revisit your examples, but now add some tags in the regular expression, translate them to tagged epsilon-transitions in the NFA, and attempt to do tagged NFA simulation and determinization. Think about the changes that are needed in these algorithms in order to track tag values. Start with easy non-ambiguous examples like a* @t b* where @t is a tag.

You may find these slides with examples useful.

Think about ambiguity in regular expressions, when it is possible to match in more than one way. Consider the following ambiguous cases with POSIX-style capturing parenthesis:

  • ambiguous concatenation, e.g. (a|ab)(bc|c) matched against "abc" can result in "ab", "c" or in "a", "bc"
  • ambiguous alternative, e.g. (a*)|(a) matched against "a" can result in "a", "" or in "", "a"
  • ambiguous iteration, e.g. (a*)* matched against "aa" can result in "a", "a" or in "aa"

Step 2 - reading the papers

I think it is may be not a easy thing to me but I will try to do, as I find TDFA is interesting.

Thank you for your answer.