skvadrik / re2c

Lexer generator for C, C++, Go and Rust.

Home Page:https://re2c.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Generated code is too large when many stags are used

true-grue opened this issue · comments

The code with lots of stags, generated by re2c, is surprisingly too large.

Here is a simple test, not a working cpp program:

void test(std::string &text) {
    const char *YYCURSOR = text.data();
    const char *YYMARKER;
    /*!stags:re2c format = 'const char *@@;'; */
    const char *a1, *a2, *b1, *b2, *c1, *c2, *d1, *d2;
    const char *e1, *e2, *f1, *f2, *g1, *g2, *h1, *h2;
    for (;;) {
    /*!re2c
        re2c:define:YYCTYPE = char;
        re2c:yyfill:enable = 0;
        re2c:tags = 1;
        sp = [ \t]*;
        int = [0-9]+;

        "L" sp @a1 int @a2
            sp @b1 int @b2
            sp @c1 int @c2
            sp @d1 int @d2
            sp @e1 int @e2
            sp @f1 int @f2
            sp @g1 int @g2
            sp @h1 int @h2 {
            continue;
        }

        * { return; }
    */
    }
}

The result file generated by re2c for this tiny code is no less than 200 KB!

Is there a way to fix it or it's a consequence of using direct-code method in re2c instead of table-based one?

UPDATE. In fact, it's a pathological example and to fix it one need to rewrite sp as the following:
sp = [ \t]+;

Still, it would be great to make the generated code more compact. Maybe it makes sense to try to automatically refactor some identical parts into standalone functions. Looks like a good research theme!

This happens because your sp token allows zero length, So this essentially means that you can have ten int tokens not separated by anything, but with tags in between, which makes tags very ambiguous. That's why the generated file has many states and tons of tag operations. Change sp to [ \t]+ (except perhaps for the first occurrence after "L") and the output is much smaller because all ambiguity is removed.

As for direct-encoded vs. table-based, it is not relevant here: what matters is the number of states. In the ambiguous case the generated DFA has 388 states, and in the non-ambiguous case only 21. A large number of states would require large table sizes as well.

@skvadrik It looks like due to some technical reason you've seen only my initial message, before I've edited it.

Anyway, thank you!