Generated code is too large when many stags are used

Question

Generated code is too large when many stags are used

true-grue opened this issue 2 years ago · comments

The code with lots of stags, generated by re2c, is surprisingly too large.

Here is a simple test, not a working cpp program:

void test(std::string &text) {
    const char *YYCURSOR = text.data();
    const char *YYMARKER;
    /*!stags:re2c format = 'const char *@@;'; */
    const char *a1, *a2, *b1, *b2, *c1, *c2, *d1, *d2;
    const char *e1, *e2, *f1, *f2, *g1, *g2, *h1, *h2;
    for (;;) {
    /*!re2c
        re2c:define:YYCTYPE = char;
        re2c:yyfill:enable = 0;
        re2c:tags = 1;
        sp = [ \t]*;
        int = [0-9]+;

        "L" sp @a1 int @a2
            sp @b1 int @b2
            sp @c1 int @c2
            sp @d1 int @d2
            sp @e1 int @e2
            sp @f1 int @f2
            sp @g1 int @g2
            sp @h1 int @h2 {
            continue;
        }

        * { return; }
    */
    }
}

The result file generated by re2c for this tiny code is no less than 200 KB!

Is there a way to fix it or it's a consequence of using direct-code method in re2c instead of table-based one?

UPDATE. In fact, it's a pathological example and to fix it one need to rewrite sp as the following:
sp = [ \t]+;

Still, it would be great to make the generated code more compact. Maybe it makes sense to try to automatically refactor some identical parts into standalone functions. Looks like a good research theme!

Ulya Trofimovich · Answer 1 · Sun Mar 27 2022 00:01:04 GMT+0800 (China Standard Time)

This happens because your sp token allows zero length, So this essentially means that you can have ten int tokens not separated by anything, but with tags in between, which makes tags very ambiguous. That's why the generated file has many states and tons of tag operations. Change sp to [ \t]+ (except perhaps for the first occurrence after "L") and the output is much smaller because all ambiguity is removed.

As for direct-encoded vs. table-based, it is not relevant here: what matters is the number of states. In the ambiguous case the generated DFA has 388 states, and in the non-ambiguous case only 21. A large number of states would require large table sizes as well.

Peter Sovietov · Answer 2 · Sun Mar 27 2022 01:46:13 GMT+0800 (China Standard Time)

@skvadrik It looks like due to some technical reason you've seen only my initial message, before I've edited it.

Anyway, thank you!