Logs from execs/services can get dropped/cutoff
sipsma opened this issue · comments
Erik Sipsma commented
Repro test:
func TestLogDrop(t *testing.T) {
t.Parallel()
var logs safeBuffer
c, ctx := connect(t, dagger.WithLogOutput(&logs))
_, err := c.Container().
From(alpineImage).
WithEnvVariable("BUST", identity.NewID()).
WithExec([]string{"sh", "-c", `i=0; while [ $i -lt 2000 ]; do echo ` + strings.Repeat("$i", 100) + `; i=$((i + 1)); done`}).
Sync(ctx)
require.NoError(t, err)
require.NoError(t, c.Close())
scanner := bufio.NewScanner(&logs)
var i int
expectedLine := strings.Repeat(strconv.Itoa(i), 100)
for scanner.Scan() {
line := scanner.Text()
// TODO: just for debug
t.Log(line)
if strings.HasSuffix(line, expectedLine) {
i++
expectedLine = strings.Repeat(strconv.Itoa(i), 100)
}
}
require.NoError(t, scanner.Err())
require.Equal(t, 2000, i)
}
Never passes locally for me, will make it a few hundred expected lines in before missing one.
Ran into this problem IRL while debugging engine stuff w/ buildkit scheduler debug logs enabled: #7128
Only workaround was to just run a non-nested dev engine, which made the full logs available via docker logs
.
I wouldn't personally consider it a bug that we ever drop logs; having some sort of limit in some part of the telemetry pipeline makes perfect sense, however:
- I'm not actually sure that's what's happening here, as opposed to logs being missing due to some actual bug
- Even if that is what's being hit, it seems like we hit a lot earlier than ideal. The repro above is only writing 2000 lines, which isn't particularly crazy even if they are being written rapidly.
- The fact that
docker logs
handles it okay seems particularly telling. Obviously we have a lot more going on with our telemetry, but the fact that I can't get all the logs even just locally feels wrong.
- The fact that
- In that PR I linked to above, I also sometimes saw errors like
OpenTelemetry error: grpc received message larger than max
(approximately that, based on memory), which does feel like a genuine bug.- However, I am not seeing that anywhere in the repro above, so I don't know what to make of it. Maybe red herring or maybe the error itself gets dropped sometimes in these cases?
cc @vito
Justin Chadwell commented
Could this potentially be related to something like #7227?