skvadrik / re2c

Lexer generator for C, C++, Go and Rust.

Home Page:https://re2c.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

mismatched YYBACKUP and YYRESTORE (again)

mejedi opened this issue · comments

The generated state machine invokes YYRESTORE without calling YYBACKUP first (for certain inputs). It results in a subsequent crash since we are restoring with a bogus state.

Please let me know if the below assumptions are wrong

  • YYBACKUP must be executed at least once before YYRESTORE
  • $ rule fires when EOF is reached AND state machine is not in a final state AND could not reach final state via YYRESTORE.

Here's the reproducer (I'm on re2c 3.1):

/*!re2c

  re2c:eof = 0;
  re2c:yyfill:enable = 0;

  [A>]* ">" {}
  $ {}

*/

Please find the first 20 lines of the output below:

$ re2go repro.re  --no-generation-date -i  -o /dev/stdout | head -n 20
// Code generated by re2c 3.1, DO NOT EDIT.

{
	var yych YYCTYPE
	yych = YYPEEK
	switch (yych) {
	case '>':
		goto yy2
	case 'A':
		goto yy4
	default:
		if (YYLESSTHAN) {
			goto yy5
		}
		goto yy1
	}
yy1:
	YYRESTORE
	goto yy3
yy2:

Hi Nick, thanks for reporting. There is a catch here: you should define default rule *, otherwise control flow is undefined on some paths (and these paths happen to be the ones that do YYRESTORE without a previous YYBACKUP).

With -W re2c warns about this:

$ re2go repro.re -W -i
repro.re:1:0: warning: control flow is undefined for strings that match
	'\x41'
	'[\x0-\x3D\x3F-\x40\x42-\xFF]'
, use default rule '*' [-Wundefined-control-flow]
// Code generated by re2c 3.1 on Wed May 15 22:29:11 2024, DO NOT EDIT.

{
	var yych YYCTYPE
	yych = YYPEEK
	switch (yych) {
	case '>':
		goto yy2
	case 'A':
		goto yy4
	default:
		if (YYLESSTHAN) {
			goto yy5
		}
		goto yy1
	}
yy1:
	YYRESTORE
	goto yy3
yy2:
	YYSKIP
	YYBACKUP
	yych = YYPEEK
	switch (yych) {
	case '>':
		goto yy2
	case 'A':
		goto yy4
	default:
		goto yy3
	}
yy3:
	{}
yy4:
	YYSKIP
	yych = YYPEEK
	switch (yych) {
	case '>':
		goto yy2
	case 'A':
		goto yy4
	default:
		goto yy1
	}
yy5:
	{ $ }
}

If you add default rule *, the warning is gone, and the generated code is correct:

$ cat fixed.re
/*!re2c

  re2c:eof = 0;
  re2c:yyfill:enable = 0;

  [A>]* ">" {}
  $ { $ }
  * { * }

*/
$ re2go fixed.re -W -i
// Code generated by re2c 3.1 on Wed May 15 22:27:22 2024, DO NOT EDIT.

{
	var yych YYCTYPE
	yyaccept := 0
	yych = YYPEEK
	switch (yych) {
	case '>':
		goto yy3
	case 'A':
		goto yy5
	default:
		if (YYLESSTHAN) {
			goto yy8
		}
		goto yy1
	}
yy1:
	YYSKIP
yy2:
	{ * }
yy3:
	yyaccept = 0
	YYSKIP
	YYBACKUP
	yych = YYPEEK
	switch (yych) {
	case '>':
		goto yy3
	case 'A':
		goto yy6
	default:
		goto yy4
	}
yy4:
	{}
yy5:
	yyaccept = 1
	YYSKIP
	YYBACKUP
	yych = YYPEEK
	switch (yych) {
	case '>':
		goto yy3
	case 'A':
		goto yy6
	default:
		goto yy2
	}
yy6:
	YYSKIP
	yych = YYPEEK
	switch (yych) {
	case '>':
		goto yy3
	case 'A':
		goto yy6
	default:
		goto yy7
	}
yy7:
	YYRESTORE
	if (yyaccept == 0) {
		goto yy4
	} else {
		goto yy2
	}
yy8:
	{ $ }
}

Please let me know if the below assumptions are wrong

  • YYBACKUP must be executed at least once before YYRESTORE
  • $ rule fires when EOF is reached AND state machine is not in a final state AND could not reach final state via YYRESTORE.

The first assumption is correct. The second one is not: $ rule fires in the initial state (and takes priority over any other rule accepting empty string). It does not fire in any state where lexer reaches the end of input - those cases are covered by the default rule *. This is because the $ rule means "the lexer has reached the end of input" not "the lexer has consumed whatever garbage happened to be there and then reached the end of input". If you look at the generated code (for both repro.re and fixed.re) you'll see that there's only one transition from the initial state to the $ rule, and it is guarded by YYLESSTHAN check. I hope this makes sense.

mismatched YYBACKUP and YYRESTORE (again)

Again - are you referring to some other bug in the past? Can you link it here?

@skvadrik thank you for the insightful response. A hint concerning -w switch is especially useful.

I made a mistake when reducing the test case. Please find a slightly bigger sample below

/*!re2c

  re2c:eof = 0;
  re2c:yyfill:enable = 0;

  .* ">" { good }
  [^\n>]* "\n" { unterm_nl }
  $ { unterm_eof }

*/

The intention is to match the closing > delimiter. If there are multiple found before the newline, use the rightmost one. This time, -w doesn't produce a warning.

The output is quite unwieldy, a file is attached. It looks like "A" is processed as follows

  • yych <= '\n' is false, execute else branch in if statement on line 6
  • fall through and reach yy1: since the character is not >
  • current character is now 0 (a sentinel)
  • yych <= '\n' is true, execute then branch in if on line 24
  • both yych <= 0 and YYLESSTHAN are true, hence goto yy6
  • YYRESTORE

Again - are you referring to some other bug in the past? Can you link it here?

#284

The intention is to match the closing > delimiter. If there are multiple found before the newline, use the rightmost one.

The rule .* ">" is greedy, so the generated lexer always attempts to match more input: it matches any > as part of .*. Normally one would explicitly exclude the terminator: [^\n>]* ">" or, if you want to allow escaped >, then ([^\n>] | "\\>")* ">". But in your case, if you want to allow unescaped >, you need something like ([^>\x00]* [>])+.

See this small example for demonstration:

#include <assert.h>

int lex(const char *s) {
    const char *YYCURSOR = s, *YYMARKER;
    /*!re2c
        re2c:yyfill:enable = 0;
        re2c:define:YYCTYPE = char;

        ([^>\x00]* [>])+ { return YYCURSOR - s; }
        *                { return -1; }
    */
}

int main() {
    assert(lex("aaaabb") == -1);
    assert(lex("aa>aabb") == 3);
    assert(lex("aa>aa>bb") == 6);
    assert(lex(">>>>") == 4);
    return 0;
}

The same can be done with the $ rule (there's no principal difference here, the example is a bit bigger, but posting for completeness):

#include <assert.h>
#include <string.h>

int lex(const char *s) {
    const char *YYCURSOR = s, *YYMARKER, *YYLIMIT = s + strlen(s);
    /*!re2c
        re2c:yyfill:enable = 0;
        re2c:eof = 0;
        re2c:define:YYCTYPE = char;

        ([^>\x00]* [>])+ { return YYCURSOR - s; }
        *                { return -1; }
        $                { return -2; }
    */
}

int main() {
    assert(lex("") == -2);
    assert(lex("aaaabb") == -1);
    assert(lex("aa>aabb") == 3);
    assert(lex("aa>aa>bb") == 6);
    assert(lex(">>>>") == 4);
    return 0;
}

It would be better if re2c warned you about unintentional greedy iteration, but it can only do so if the iteration never stops, and here it does stop on \n.

This time, -w doesn't produce a warning.

Are you sure? It does for me:

warning: control flow is undefined for strings that match '[\x0-\x9\xB-\x3D\x3F-\xFF]', use default rule '*' [-Wundefined-control-flow]

@skvadrik
Thank you for your help; I think I was finally able to fix my misconceptions. Apparently getting the following bits is crucial to use re2c successfully. It would be nice if they were emphasised more prominently in the docs:

  1. automata must handle any possible input; failing to do so invites nasal demons (undefined behaviour);
  2. for this reason, always run with warnings on;
  3. EOF rule is matched if there was no input consumed so far;
  4. default rule consumes a single character and there's no way to tell how much progress was made (in terms of characters) before failing.

Playing with the sample you helpfully provided:

#include <assert.h>
#include <string.h>

int lex(const char *s) {
    const char *YYCURSOR = s, *YYMARKER, *YYLIMIT = s + strlen(s);
    /*!re2c
        re2c:yyfill:enable = 0;
        re2c:eof = 0;
        re2c:define:YYCTYPE = char;

        .* ">" { return YYCURSOR - s; }
        *      { return s - YYCURSOR; /* negated token length */ }
        $      { return -1000; }
    */
}

int main() {
    assert(lex("") == -1000); // <-- EOF rule
    assert(lex("1") == -1); // <-- default rule
    assert(lex("1\n") == -1); // <-- default rule
    assert(lex("12") == -1); // <-- default rule; I wish it returned -2
    assert(lex(">1") == 1);
    assert(lex(">1\n") == 1);
    assert(lex(">1>") == 3);
    return 0;
}

This time, -w doesn't produce a warning.

Are you sure? It does for me:

My bad, there was the warning.


Sorry for the noise, and thanks for the help.

It would be nice if they were emphasised more prominently in the docs

  1. automata must handle any possible input; failing to do so invites nasal demons (undefined behaviour);
  2. for this reason, always run with warnings on;
  3. EOF rule is matched if there was no input consumed so far;
  4. default rule consumes a single character and there's no way to tell how much progress was made (in terms of characters) before failing.

There are some docs explaining these things, although they may be not easy to find. Linking here in case someone will be reading this while debugging similar issues:

As for 2, it would be good to set -Wundefined-control-flow to be an error by default, but we can't do this for C as re2c is an old tool and there's a lot of old code without default rule * (previously [^] was used as a replacement for *, but it's not quite the same). Maybe for other languages it should become an error. Note that there is a -Werror option (and -Werror-undefined-control-flow in particular).

@skvadrik I have started to think that some of the warnings should probably be default on instead of default off. It is hard to get people to notice everything in the documentation, but error messages are prominently self documenting.

@skvadrik I have started to think that some of the warnings should probably be default on instead of default off. It is hard to get people to notice everything in the documentation, but error messages are prominently self documenting.

For all languages except C, I agree; for C we need to do some investigation first and see if php / ninja / other old users of re2c that we know of will be broken by this - if they are, we should reach out and fix them first.

Anyway, I agree something should be done about this, as it keeps causing people a lot of debugging for no good reason.

If re2c provides the ability to turn off the warning, and maybe even prints out how to do it in the default warning message, it will be easy for downstream packagers of older software that uses re2c to notice what the fix is. I agree, though, that php and ninja and other well known users should be informed if they get broken.

I'll reopen this issue as a reminder to do something about it.

If re2c provides the ability to turn off the warning

There's been -Wno-* and -Wno-error-* options since the addition of warnings, and re2c prints the category (e.g. [-Wundefined-control-flow]) at the end of the warning/error message.

Then changing the default should be pretty safe.