dagger / dagger

An engine to run your pipelines in containers

Home Page:https://dagger.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Logs from execs/services can get dropped/cutoff

sipsma opened this issue · comments

Repro test:

func TestLogDrop(t *testing.T) {
	t.Parallel()

	var logs safeBuffer
	c, ctx := connect(t, dagger.WithLogOutput(&logs))

	_, err := c.Container().
		From(alpineImage).
		WithEnvVariable("BUST", identity.NewID()).
		WithExec([]string{"sh", "-c", `i=0; while [ $i -lt 2000 ]; do echo ` + strings.Repeat("$i", 100) + `; i=$((i + 1)); done`}).
		Sync(ctx)
	require.NoError(t, err)
	require.NoError(t, c.Close())

	scanner := bufio.NewScanner(&logs)
	var i int
	expectedLine := strings.Repeat(strconv.Itoa(i), 100)
	for scanner.Scan() {
		line := scanner.Text()

		// TODO: just for debug
		t.Log(line)

		if strings.HasSuffix(line, expectedLine) {
			i++
			expectedLine = strings.Repeat(strconv.Itoa(i), 100)
		}
	}
	require.NoError(t, scanner.Err())
	require.Equal(t, 2000, i)
}

Never passes locally for me, will make it a few hundred expected lines in before missing one.

Ran into this problem IRL while debugging engine stuff w/ buildkit scheduler debug logs enabled: #7128

Only workaround was to just run a non-nested dev engine, which made the full logs available via docker logs.


I wouldn't personally consider it a bug that we ever drop logs; having some sort of limit in some part of the telemetry pipeline makes perfect sense, however:

  1. I'm not actually sure that's what's happening here, as opposed to logs being missing due to some actual bug
  2. Even if that is what's being hit, it seems like we hit a lot earlier than ideal. The repro above is only writing 2000 lines, which isn't particularly crazy even if they are being written rapidly.
    • The fact that docker logs handles it okay seems particularly telling. Obviously we have a lot more going on with our telemetry, but the fact that I can't get all the logs even just locally feels wrong.
  3. In that PR I linked to above, I also sometimes saw errors like OpenTelemetry error: grpc received message larger than max (approximately that, based on memory), which does feel like a genuine bug.
    • However, I am not seeing that anywhere in the repro above, so I don't know what to make of it. Maybe red herring or maybe the error itself gets dropped sometimes in these cases?

cc @vito

Could this potentially be related to something like #7227?