Improve handling of large process output when running metadata-script

Question

Improve handling of large process output when running metadata-script

mattdelco opened this issue 3 years ago · comments

Things don't seem to be going well in the handling of large process output, either because of or in spite of pull #140. In an instance log I see:

google_metadata_script_runner[980]: error while communicating with "startup-script-url" script: bufio.Scanner: token too long
google_metadata_script_runner[980]: 2021/12/22 18:51:17 logging client: rpc error: code = PermissionDenied desc = Request had insufficient authentication scopes.
google_metadata_script_runner[980]: startup-script-url signal: broken pipe

With pull #140 the pipe is now closed, but closing the pipe can mean that the write() will fail (in my experimentation it actually throws a signal and kills the process, rather than just returning an error). I'm not sure if it's intentional to be doing something to lead to the likely killing/crash of the process. Perhaps what should be done instead is leave the pipe open so the process can still run, though to be more robust guest-agent would still need to read from the pipe so the child process doesn't block if the pipe eventually fills up.

A simple fix that might help reduce the "token too long" issue is to increase the amount of buffering, e.g.:

in.Buffer(make([]byte, 0, 4 * 1024), 2 * 1024 * 1024)

The context for this is that I'm running cloud builder, which seem to use this functionality to run the build in an instance and seems to get hung up on this issue. The particular command being run is "gsutil -m cp -r $COMPONENTS ." to copy ~6 files, and I presume something went wrong with gsutil that caused it to generate a bunch of text output that overwhelmed what the text scanner can handle.

Tom Downes · Answer 1 · Wed Dec 28 2022 02:06:20 GMT+0800 (China Standard Time)

I have observed this while using the standard startup-script approach (rather than URL):

Oct 23 10:26:57 myhostname google_metadata_script_runner[1495]: error while communicating with \"startup-script\" script: bufio.Scanner: token too long
Oct 23 10:26:57 myhostname google_metadata_script_runner[1495]: startup-script signal: broken pipe
Oct 23 10:26:57 myhostname google_metadata_script_runner[1495]: Finished running startup scripts.

Liam Hopkins · Answer 2 · Wed Dec 28 2022 03:48:25 GMT+0800 (China Standard Time)

Hi @tpdownes if this issue is currently affecting you can you share details:

when did you see the issue and what triggered it
what version of the google-guest-agent package do you have installed

if we can show this is a current issue that is reproducible, it will make it easier for us to determine a possible fix. the issue op is referencing was where large output could cause a script to hang, which was resolved.

Liam Hopkins · Answer 3 · Wed Dec 28 2022 04:54:40 GMT+0800 (China Standard Time)

Discussed with tpdownes offline; this issue is related to startup script output with very long lines. We read the output from the script line-wise in order to facilitate multiplexing log output to system logging and cloud logging. That means we need to set some reasonable upper bound on the length of a single line or invest in a more complex arbitrary batching of lines. Since this is very rarely reported (now 2x in a year) we are not prioritizing that, and instead encourage script users to work around this by redirecting or modifying output of their scripts when they are very large. We will add some documentation about this on this repository.

mattdelco · Answer 4 · Wed Feb 01 2023 12:10:17 GMT+0800 (China Standard Time)

The builds I have for cloud builder got hit by this again. Basically 'gsutil cp' has progress reports that only use \r, and golang's default ScanLines implementation doesn't recognize this as a line ending (it basically only recognizes \n, though will strip a \r if it immediately precedes the \n). To handle this there'd need to be a ScanLines that can handle \r, e.g.:

func ScanLines2(data []byte, atEOF bool) (advance int, token []byte, err error) {
if atEOF && len(data) == 0 {
return 0, nil, nil
}
if i := bytes.IndexByte(data, '\n'); i >= 0 {
// We have a full newline-terminated line.
return i + 1, dropCR(data[0:i]), nil
}
// If found a carriage-return and it's not the last character seen so far
// (need to allow a subsequent newline to follow in the stream).
if j := bytes.IndexByte(data, '\r'); j >= 0 && j != len(data) - 1 {
// We have a full carriage-return-terminated line.
return j + 1, data[0:j], nil
}

    // If we're at EOF, we have a final, non-terminated line. Return it.
    if atEOF {
            return len(data), dropCR(data), nil
    }
    // Request more data.
    return 0, nil, nil

}

and then google_metadata_script_runner's runCmd() would need to call "in.Split(ScanLines2)" after it calls "in := bufio.NewScanner(pr)".

Regardless of whether the above is implemented it'd help for runCmd() to not call "pr.Close()" after the for loop. This basically amounts to sending a signal to the child process to kill it (assuming the child sends at least one more character to stdout after the close, which is probably will), and this close is probably already covered by the "defer pr.Close()" that's at the top of runCmd(). It'd probably also help to call "pr.Read(make([]byte, bufio.MaxScanTokenSize))" after the line that logs "error while communicating with" so that the pipe gets emptied -- if the pipe fills up then the child process's next write to stdout will block, so emptying the pipe will let the child spew another 64KB before it locks up.