skvadrik / re2c

Lexer generator for C, C++, Go and Rust.

Home Page:https://re2c.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Non-greedy operator

scossu opened this issue · comments

I have the following regular expression chain:

/*!re2c
    HEX             = [\x30-\x39\x41-\x46];
    CHAR_BASE       = "\\u" HEX{4} | "\\U" HEX{8} | '\\' | [\U0000005D-\U0010FFFF];
    CHARACTER       = CHAR_BASE | [\x20-\x5B];
    ECHAR           = CHARACTER | ([\\] [tnr]);
    LCHAR           = ECHAR | ([\\] ["]) | [\t\n\r];
    LSTRING         = [\x22]{3} LCHAR* [\x22]{3};

    LSTRING { /* do something */ }
*/

This is meant to match a Turtle long string which is enclosed in triple double quotes and may contain individual double quotes.

I have tried to keep the syntax as close to the spec but it's not working as expected. E.g. it matches

"""This is a string."""
"""This is another string."""

as one token.

This is probably because [\x22]{3} LCHAR* [\x22]{3} will keep eating up triple quotes as individual ones, until it finds the final triple quote from the second string.

Is it possible to specify a non-greedy operator, or work around that in some way?

Thanks.

Hi! You are right, it happens because of greediness. You can restructure the lexer to eat one LCHAR at a time and loop, this way the terminating three quotes will take prcedence. See the example below (I assumed null-terminated strings, so I added exclusion of null in the middle of a string as well --- but this is unrelated).

#include <assert.h>
#include <stdio.h>

int lex(const char *s) {
    const char *YYCURSOR = s, *YYMARKER;
    /*!re2c
        re2c:yyfill:enable = 0;
        re2c:define:YYCTYPE = "unsigned char";
        re2c:encoding:utf8 = 1;

        HEX             = [\x30-\x39\x41-\x46];
        CHAR_BASE       = "\\u" HEX{4} | "\\U" HEX{8} | '\\' | [\U0000005D-\U0010FFFF];
        CHARACTER       = CHAR_BASE | [\x20-\x5B];
        ECHAR           = CHARACTER | ([\\] [tnr]) | [\x00];
        LCHAR           = ECHAR | ([\\] ["]) | [\t\n\r];
    */
    int count = 0;
space:
    /*!re2c
        *         { return -1; }
        [\x22]{3} { goto lchar; }
        [\n]      { goto space; }
        [\x00]    { return count; }
    */
lchar:
    /*!re2c
        *         { return -1; }
        [\x00]    { return -2; }
        [\x22]{3} { ++count; goto space; }
        LCHAR     { goto lchar; }
    */
}

int main() {
    assert(lex("\"\"\"one\"\"\"") == 1);
    assert(lex("\"\"\"one\"\"\"\n\"\"\"two\"\"\"") == 2);
    assert(lex("\"\"\"one\"\"\"\n\"\"\"two\"\"\"\n\"\"\"th\\\"ree\"\"\"") == 3);
    assert(lex("\"\"\"unterminated\"\"") == -2);
    return 0;
}

I don't think it's possible to do what you want in one lexeme --- I can imagine it if re2c supported negation operator (which it doesn't), but even so the resulting automaton would be unnecessarily large due to the necessity to unfold counted repetition of quotes.

By the way you can also use start conditions to write multiple lexer blocks as one.

Thanks! I will try to implement your solution, and look into starting conditions which I haven't yet grasped completely.