Flex segfaults after reading EOF in `input()`

Question

Flex segfaults after reading EOF in `input()`

nxg opened this issue 6 months ago · comments

The program below works as expected when reading from stdin, but segfaults when it is instead lexing a buffer.

The key thing about this example is that one of the rules uses input() to gobble from "!" to EOF (yes, it looks as if I could use a "!".* pattern, but that doesn't produce the intended results in the real case; the lexer needs to balance braces, and if I hit EOF when trying to do that, I want to recover gracefully).

When run, reading from stdin, I get

$ flex -o eof.c eof.lex
$ cc -o eof eof.c
$ echo -n 'one two!three four' | ./eof
word:<one>
-> 1
-> 2
word:<two>
-> 1
buf=<three four>
-> 3

That's fine, but when I instead ./eof 'one two !three four', which scans the contents of a buffer set up by yy_scan_string, I get identical program output, followed by a segfault inside yy_get_next_buffer.

I can't work out which part of the flex manual is telling me I should expect that to happen.

The sequence of events seems to be that the lexer is finding its way to the end of file, as expected (and an <<EOF>> action confirms this), but not stopping there, despite the presence of the noyywrap option, and collapsing when it can't find a ‘next’ buffer.

Points:

Option -d doesn't illuminate.
It is, of course, a little hard to follow what the generated code is doing, but looking at the location of the segfault, it is indeed around the place where the code is checking for yywrap, so it should be getting the message that there is no more input coming.
The only real illustration of using input(), in the flex manual, is in a case where hitting EOF is reported as an error. Here, I'm doing essentially the same as in that example, but regarding EOF as an acceptable end of the scan.
The same behaviour appears when using a reentrant scanner.
It's worth noting that input() returns 0, not EOF, at EOF, despite what Sect.8 illustrates (cf. flex repo issue, and links there), and despite the rather mysterious note about a ‘“real” end-of-file’ in Sect.20. I have a suspicion that this remark in Sect.20 is telling me something terribly important, but I can't work out what.
This is with flex 2.6.4 and clang 15 on macOS, and 2.6.4 and gcc on (a RHEL-derived) Linux (I can confirm the precise gcc version if that would be helpful, but this doesn't look obviously compiler dependent)..

Program:

ALPHABETIC  [a-zA-Z]
WS      [^a-zA-Z!]

%option noyywrap nounput

%%

{ALPHABETIC}+   {
    printf("word:<%s>\n", yytext);
    return 1;
}
{WS}+   {
    return 2;
}

"!"         {  // gobble to end of input
    char buf[80];
    for (int idx=0; (buf[idx] = input()); idx++) /* empty */ ;
    printf("buf=<%s>\n", buf);
    // YY_FLUSH_BUFFER; /* makes no difference */
    return 3;
}

%%
int main(int argc, char** argv)
{
    switch (argc) {
      case 1: break;
      case 2:
        yy_scan_string(argv[1]);
        break;
      default:
        fprintf(stderr, "Usage: %s [string]\n", argv[0]);
        exit(1);
    }

    int token;
    while ((token = yylex()) != 0) {
        printf("-> %d\n", token);
    }
}

Joseph Langley · Answer 1 · Tue Mar 26 2024 01:43:34 GMT+0800 (China Standard Time)

I can't find a spot in the docs that explains this behavior clearly. The best hints I could find are in the sections on multiple buffers, yywrap, and EOF rules.

You need an <> rule that calls yyterminate or sets up the next buffer. That rule will take the place of yywrap in your use case.

I'm away from my computer but I'll post an example when I'm back.

Norman Gray · Answer 2 · Tue Mar 26 2024 02:17:47 GMT+0800 (China Standard Time)

Thanks for clarifying.

In case it's useful when thinking about the docs, my mental model, when writing what I did, was that when I arrive at EOF using input(), I'm doing so ‘legitimately’ (ie, as opposed to my being illegitimately creative with yyinput, or something like that). It was on that basis that I presumed yywrap would Do The Right Thing, and that when flex subsequently asked for more input from input(), it would be told calmly ‘no’.

Or, put another way, my mental model is that flex is itself using input() to get input, or something equivalent to that, so that I'm working in concert with it if I read from it separately.

If those are bad intuitions, it might be useful for the docs to disabuse the reader fairly explicitly.