skvadrik / re2c

Lexer generator for C, C++, Go and Rust.

Home Page:https://re2c.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Language-specific frontends

DemiMarie opened this issue · comments

The frontend is currently not aware of the specific programming language. This is a significant problem, as it can cause code to be misparsed. I don’t have any examples yet, though.

This is a valid concern, especially if we are going to support more language backends.

There is one example I have encountered when the parser has to be aware of the language: unpaired single quotes in rust. Normally re2c parses a single quote as a beginning of a literal and looks for a matching quote to end the literal. Currently the parser has a bit of rust-specific code to deal with it.

Full parsing of all supported languages is out of the question; this is not feasible, and it would make re2c unnecessarily complicated. There may be a need for more language-specific support in the parser, for example awareness of all kinds of string literals allowed in some language.

If you have other ideas or problematic examples, you are welcome to share them.

Some of the cases that come to mind:

  1. C++/Rust/Go raw string literals

  2. Certain C preprocessor directives should be treated as comments (#pragma, #error, #warning come to mind)

  3. Comment nesting (IIRC Rust comments nest, while C and C++’s comments definitely do not.)

  4. C preprocessor macro abuse

  5. C line continuation:

    //       \
     this is still commented
    /\
    * this is also a comment */
    "\\
    "is still in the string literal"
    R"ab\
    c(a raw string literal)abc"
  6. C trigraphs (yuck)

For 5 and 6, I suggest treating any occurences of the bad cases (continued line comment, escaped newline in block comment delimiter, escaped newline after backslash in string or char literal, trigraph that could impact parsing) as syntax errors. They are all considered bad practice anyway (to the point that compilers issue warnings about them), so rejecting them should be okay.

Another problematic case that came to mind is numeric literals with single quote used as thousand separator (12'345).

It should be noted that re2c handles code outside of blocks differently from the code inside of blocks (that is, user-defined semantic actions). Although semantic actions are not parsed precisely, re2c is able to recognize comments, strings, etc., as it searches for the closing curly brace. But the code between blocks is treated more or less like a stream of raw characters.

Any effort to change this should be conservative, meaning that if re2c is unable to recognize a precise lexeme (e.g. a string, a preprocessor directive, etc.) then it should fallback to the "raw stream of characters" logic.

Any effort to change this should be conservative, meaning that if re2c is unable to recognize a precise lexeme (e.g. a string, a preprocessor directive, etc.) then it should fallback to the "raw stream of characters" logic.

I recommend issuing a warning in this case. BTW there are certain cases (such as unterminated string literals) that are undefined behavior (!!!!!) if I recall correctly. re2c can just reject those outright.

One other thought I had is to actually pipe the output of re2c through the C preprocessor, then inspect the preprocessor’s output to make sure that what re2c thought were balanced { and } actually were. I suspect this would only be viable with build system integration, or on *nix where the syntax for invoking the C compiler is mostly standardized.