Non-greedy operator

Question

Non-greedy operator

scossu opened this issue 2 years ago · comments

I have the following regular expression chain:

/*!re2c
    HEX             = [\x30-\x39\x41-\x46];
    CHAR_BASE       = "\\u" HEX{4} | "\\U" HEX{8} | '\\' | [\U0000005D-\U0010FFFF];
    CHARACTER       = CHAR_BASE | [\x20-\x5B];
    ECHAR           = CHARACTER | ([\\] [tnr]);
    LCHAR           = ECHAR | ([\\] ["]) | [\t\n\r];
    LSTRING         = [\x22]{3} LCHAR* [\x22]{3};

    LSTRING { /* do something */ }
*/

This is meant to match a Turtle long string which is enclosed in triple double quotes and may contain individual double quotes.

I have tried to keep the syntax as close to the spec but it's not working as expected. E.g. it matches

"""This is a string."""
"""This is another string."""

as one token.

This is probably because [\x22]{3} LCHAR* [\x22]{3} will keep eating up triple quotes as individual ones, until it finds the final triple quote from the second string.

Is it possible to specify a non-greedy operator, or work around that in some way?

Thanks.

Ulya Trofimovich · Answer 1 · Thu May 05 2022 14:48:11 GMT+0800 (China Standard Time)

Hi! You are right, it happens because of greediness. You can restructure the lexer to eat one LCHAR at a time and loop, this way the terminating three quotes will take prcedence. See the example below (I assumed null-terminated strings, so I added exclusion of null in the middle of a string as well --- but this is unrelated).

#include <assert.h>
#include <stdio.h>

int lex(const char *s) {
    const char *YYCURSOR = s, *YYMARKER;
    /*!re2c
        re2c:yyfill:enable = 0;
        re2c:define:YYCTYPE = "unsigned char";
        re2c:encoding:utf8 = 1;

        HEX             = [\x30-\x39\x41-\x46];
        CHAR_BASE       = "\\u" HEX{4} | "\\U" HEX{8} | '\\' | [\U0000005D-\U0010FFFF];
        CHARACTER       = CHAR_BASE | [\x20-\x5B];
        ECHAR           = CHARACTER | ([\\] [tnr]) | [\x00];
        LCHAR           = ECHAR | ([\\] ["]) | [\t\n\r];
    */
    int count = 0;
space:
    /*!re2c
        *         { return -1; }
        [\x22]{3} { goto lchar; }
        [\n]      { goto space; }
        [\x00]    { return count; }
    */
lchar:
    /*!re2c
        *         { return -1; }
        [\x00]    { return -2; }
        [\x22]{3} { ++count; goto space; }
        LCHAR     { goto lchar; }
    */
}

int main() {
    assert(lex("\"\"\"one\"\"\"") == 1);
    assert(lex("\"\"\"one\"\"\"\n\"\"\"two\"\"\"") == 2);
    assert(lex("\"\"\"one\"\"\"\n\"\"\"two\"\"\"\n\"\"\"th\\\"ree\"\"\"") == 3);
    assert(lex("\"\"\"unterminated\"\"") == -2);
    return 0;
}

I don't think it's possible to do what you want in one lexeme --- I can imagine it if re2c supported negation operator (which it doesn't), but even so the resulting automaton would be unnecessarily large due to the necessity to unfold counted repetition of quotes.

Ulya Trofimovich · Answer 2 · Thu May 05 2022 15:22:31 GMT+0800 (China Standard Time)

By the way you can also use start conditions to write multiple lexer blocks as one.

Stefano Cossu · Answer 3 · Thu May 05 2022 23:05:47 GMT+0800 (China Standard Time)

Thanks! I will try to implement your solution, and look into starting conditions which I haven't yet grasped completely.