Capture partial frames for syscalls which exceed bpf `LOOP`/`CHUNK` limits

Question

Capture partial frames for syscalls which exceed bpf `LOOP`/`CHUNK` limits

benkilimnik opened this issue 7 months ago · comments

Problem Description
Pixie is unable to capture syscalls with an iovcnt >42 or a message size >120 KiB .

These variables are set conservatively to keep the instruction count below BPF's limit for version 4 kernels (4096 per probe). These limits, however, result in data loss and incomplete syscall tracing. For example, in a community-shared NodeJS application transferring just 10kB of data, the iovec array contained 257 entries, which is well beyond the current LOOP_LIMIT of 42. We've also seen the message size (CHUNK_LIMIT) exceeded in k8ssandra deployments. This is very likely an issue across all protocols.

Proposed Solutions

Dynamically increase the loop limit for newer kernels with higher instruction limits (1 million for kernels > 5.1). This could mitigate the issue, though it would likely persist for large messages/iovecs. (This approach could be combined with option 2. to increase both the amount of data and number of frames Pixie can trace)

Each PEM now tracks its kernel version due to changes introduced in #1685. We could pass this version in as a compile time flag / preprocessor directive.
Note that even if bpf were able to trace everything for large messages, Pixie would still truncate the data (e.g. based on FLAGS_max_body_bytes for HTTP). However, capturing complete metadata could still be invaluable, conveying headers, response codes, and other important information.

For each event where data loss occurs due to LOOP/CHUNK limits, pass metadata to the event parser, which attempts to process a partial frame. For this to work, protocol parsers must be modified to work lazily, parsing as far as possible and returning a new parseState kPartialFrame when they've processed enough bytes to capture essential metadata.

After our LOOP/CHUNK limit is reached, the event parser will eventually receive a contiguous head of the data stream buffer that ends with a gap representing the bytes we missed. Note that there could be any number of valid frames before the gap because Pixie's sampling frequency is greater than its push frequency (sampling is used loosely here, as Pixie receives every event and not a subset of them). Moreover, the application itself could be batching messages such that an incomplete chunk could contain a number of valid frames before the gap.

In BPF, we can determine the full message size and keep track of how many bytes were missed if the LOOP/CHUNK limit is reached. We can pass this information through the event to the datastream buffer, so that the event parser knows when to expect an incomplete chunk. A DCHECK would enforce that for a given call to ParseFramesLoop with a contiguous head, a partial frame is pushed at most once since we expect to only reach the gap once.
To avoid potential side effects from using the kPartialFrame state (i.e. prevent it from masking other errors), we could use a heuristic to determine if this partial frame was caused by a lack of bytes. We could store the max size of fields that we could possibly parse in the metadata of a frame. If this is greater than the number of bytes remaining, then we hit our gap, so it makes sense to push a partial frame. If however, we have sufficient bytes remaining to parse these fields, then a different error likely occurred and we don't want to push the partial frame.

One alternative option is to use tail calls to start a new bpf program where the other left off. This might be an invasive solution with some performance trade-offs (the upper nesting limit is 33 calls).

Omid Azizi · Answer 1 · Sat Oct 28 2023 07:17:06 GMT+0800 (China Standard Time)

Great write-up @benkilimnik. Here are my initial thoughts.

Option 1: We should definitely do this. It is the main way to capture more data. We had avoided this in the past because we didn't want to create different behavior on different kernels (which becomes a headache when users report issues, a headache for testing, etc.). However, I think we're at the point where we should cross this bridge. We should stay principled and make the kernel level differences only a few key parameters like the LOOP_LIMIT. We should also measure any perf impacts, because it might be that these LIMITs are actually saving us from burning too much CPU in the high bandwith cases.

Option 2: This is a good idea, but will be complex. Note that in the case where we drop data, we still do track positions correctly for the DataStreamBuffer. That's because we take the total bytes from the syscall itself (see https://github.com/pixie-io/pixie/blob/b6cf0ac3e4723a409b41cc0b603f3b8ddc1001b3/src/stirling/source_connectors/socket_tracer/bcc_bpf/socket_trace.c#L819C5-L819C22). So we currently do create gaps where these losses occur. So the bulk of this work, in my mind, becomes letting the parsers know where missing data is located, and letting it try to parse regardless. It means we need a new data format (where gaps can be encoded), and an augmentation of all of the parsers (which we could of course do incrementally). So I think Option 2 is a worthy goal, but is complex.

Option 3: We've looked into this before. I'm not sure this works that easily, because tail calls can't return, and there is processing that we do after the loop is complete. There would be a lot of state that would have to be passed from one call to another so that it could complete the work. Won't be pretty.

In short, I would definitely do Option 1. It's easier than option 3 for capturing more data, and the older kernels are starting to become less prevalent, making it a good option. Then I would follow that up with Option 2, but treading very carefully.

Ben Kilimnik · Answer 2 · Thu Nov 30 2023 05:07:47 GMT+0800 (China Standard Time)

Thanks for the comments @oazizi000. I've looked into this further:

Option 1: Raising chunk/loop limits

I have confirmed that the verifier would allow us to raise the LOOP_LIMIT and CHUNK_LIMIT by at most 21x (to 882 and 84 respectively) for kernels >= V5.1. I also ran the px-k8ssandra perf test to evaluate the impact of raising the limits (I verified that this is a workload in which the loop limit is actually reached). It is hard to tell from the graph if there is a difference in performance compared to the baseline. More fine-grained tests will likely be needed to accurately determine the CPU impact of such an increase, but this at least rules out a major degradation.

20min perf test for `px-k8ssandra` with current loop limit

Note that the only gap condition triggered in px-k8ssandra is kLoopLimitExceeded, so only increasing the loop limit should raise ingest. We currently lose around 18 MiB of HTTP data every 20min due to reaching the loop limit in this demo. More details about how these metrics were gathered can be found in the metrics section below.

incomplete_reason	lazy_parsing_enabled	protocol	total_bytes_lost_in_gaps
kExceededLoopLimit	false	kProtocolHTTP	~ 17.8 MiB lost
kFiller	false	kProtocolHTTP	~ 17.8 MiB in filler allocations

20min perf test for `px-k8ssandra` with 21x higher loop limit

With a higher loop limit, we lose significantly less HTTP data (1.2MiB instead of 17.8 MiB).

incomplete_reason	lazy_parsing_enabled	protocol	total_bytes_lost_in_gaps
kExceededLoopLimit	false	kProtocolHTTP	~ 1.2 MiB
kFiller	false	kProtocolHTTP	~1.2 MiB

Specific perf differences are hard to determine from the aggregate graphs, but this at least rules out a major degradation.

Metrics for gap size and bytes discarded before gap

I have added prometheus metrics to keep track of the number of bytes we miss (the size of the gap) in the socket tracer for several gap conditions. This metric shows us the prevalence of gaps across protocols as well as the ingest impact of raising chunk/loop limits. Additionally, I am keeping track of the number of bytes rendered unparseable due to the presence of a gap (i.e. the bytes we discard because we can't process the incomplete chunk).

With the prometheus metrics deployed in production we should see how frequently these cases arise in the wild and how much more data ingest we could get by implementing lazy parsing on a per protocol basis, or raising the loop/chunk limits.

Gap Conditions

I have identified several conditions under which we'd see a gap in the data stream buffer due to the behavior of our bpf code.

`process_syscall_sendfile`

Commit c61cbbe added support for tracing sendfile calls used by some servers to write data to a socket. We can't trace the data itself, but instead fill the missing data with null bytes up to 1MB. If the data is larger than 1MB, we create a gap in the data stream buffer (the associated incomplete chunk is most probably dropped in the event parser i.e. ParseFramesLoop returns kInvalid once the parser reaches the gap, dropping the contiguous head).

Note that the metadata (e.g. HTTP headers) preceding the file will have been sent and captured before the sendfile syscall and should therefore have been placed before the file in the data stream buffer.

`perf_submit_wrapper`

On the last iteration of the chunk loop (which chops large messages into CHUNK_LIMIT-1 parts), we submit all of the remaining bytes of the msg and truncate if it is larger than MAX_MSG_SIZE (32 KiB). ExtractFillerEvent then attempts to fill the truncated data up to kMaxFilledSizeBytes (1MB). If it is larger than this, we end up with a gap.

`perf_submit_iovecs`

There are two cases I've identified in which we end up with an incomplete chunk in perf_submit_iovecs. The first is when we exceed the LOOP_LIMIT (i.e. the msg is chopped up into more iovecs than we can process) and the second is when the iov_size exceeds the maximum filler event size (i.e. the msg is divided into individually large iovecs). I have confirmed that both of these conditions actually occur in practice in the CQL and NodeJS bpf tests (more details in the BPF tests section below).

Note that we don't know anything about the number or size of frames contained within these bytes - we could have one very big frame stretched across multiple iovecs, or many smaller frames in a single iovec. Either way, having gaps (with or without filler bytes) will most likely cause parsing to fail.

Frequency of these gap conditions

I have verified that the following gap conditions are triggered in our demo apps / bpf tests:

BPF tests:

cql_trace_bpf_test: iovExceedsMaxMsgSize
nodejs_trace_bpf_test (new bpf test adapted from a community submitted app, see #1794): iovExceedsLooplimit where iov_len = 257

Demos

`px-k8ssandra` (run for 1 hour):

Turning off filler and parsing lazily (as far as possible up to the gap) could save us 56 MB of null byte allocations per hour.

incomplete_reason	lazy_parsing_enabled	protocol	total_bytes_lost_in_gaps
kExceededLoopLimit	false	kProtocolHTTP	~ 53.3MiB lost
kFiller	false	kProtocolHTTP	~ 53.3 MiB in filler allocations

Note that in this particular case the traffic doesn't seem to reach our parser (it gets purged from the data stream buffer due to a connection resolution error printed below), which is why the unparseable_bytes_before_gap metric is 0.

W20231120 18:51:18.174887 1391373 conn_tracker.cc:421] Resetting DataStream buffers. Disabling connection dest=10.98.4.126:9966 reason=No client-side tracing: Remote endpoint is inside the cluster.
W20231120 18:51:18.175134 1391373 conn_tracker.cc:421] Resetting DataStream buffers. Disabling connection dest=-:-1 reason=Could not determine role for traffic because connection resolution failed.

incomplete_reason	lazy_parsing_enabled	protocol	unparseable_bytes_before_gap
kFullyFormed	false	kProtocolHTTP	0.0

`px-kafka` (run for 30min)

The kafka demo features a significant number of sendfile events, which we currently pad with filler. Turning off filler events and parsing lazily could save us ~175MB every 30min.

incomplete_reason	lazy_parsing_enabled	protocol	total_bytes_lost_in_gaps
kHeaderEvent	false	kProtocolKafka	0.0
kFiller	false	kProtocolKafka	~166.4 MiB in filler allocations
kSendFile	false	kProtocolKafka	~ 166.4 MiB lost

Note that sendfile events contain no data (the metadata is transferred previously) so our unparseable_bytes_before_gap is zero as well. With lazy parsing, we could part with filler events and just process the metadata.

Option 2: lazy parsing

Instead of relying on filler events with large null byte allocations, we could modify the protocol parsers to run lazily, parsing as far as possible. Lazy parsing would let us capture partial frames if we know from bpf that a gap is coming. To avoid masking unrelated errors, we should push a partial frame only if we know there is a gap coming. In the sendfile case, we only know at the start of the file (after the metadata is captured) that there is a gap coming up, so a lazy parser would need to return a ParseState akin to kMetadataComplete, where metadata is the minimum amount of information we need to successfully stitch a frame.

I have put together a possible lazy implementation of the HTTP parser on #1789. Any feedback is much appreciated.

Originally, I considered only pushing a partial frame if the protocol parser got past the incomplete event start. This turned out to be difficult to implement as events can be very small. For instance, a community submitted sample NodeJS application has an iovec with 257 elements each of which contains only ~16 bytes. With a loop limit of 42, we merge 41 events into the data stream buffer, getting us to 878 bytes. Then on 42nd iov we reach the loop limit, so our incomplete event begins at 878th byte. When the protocol parser receives the head with this incomplete event, it is able to parse up to the 157th byte (kMetdataComplete). If we only pushed a partial frame after the incomplete event start (i.e. after position 878), we would miss the partial frame at the start. So instead, the lazy parser I built treats a kMetadataComplete in a contiguous head with an incomplete chunk as a valid partial frame. It also adds a DCHECK to ensure that a contiguous head that ends in a gap should contain at most one partial frame.

Related PRs

#1786, #1787, #1788, #1789, #1790, #1791, #1792, #1793, #1794, #1795

Capture partial frames for syscalls which exceed bpf `LOOP`/`CHUNK` limits

Option 1: Raising chunk/loop limits

20min perf test for px-k8ssandra with current loop limit

20min perf test for px-k8ssandra with 21x higher loop limit

Metrics for gap size and bytes discarded before gap

Gap Conditions

process_syscall_sendfile

perf_submit_wrapper

perf_submit_iovecs

Frequency of these gap conditions

BPF tests:

Demos

px-k8ssandra (run for 1 hour):

px-kafka (run for 30min)

Option 2: lazy parsing

Related PRs

20min perf test for `px-k8ssandra` with current loop limit

20min perf test for `px-k8ssandra` with 21x higher loop limit

`process_syscall_sendfile`

`perf_submit_wrapper`

`perf_submit_iovecs`

`px-k8ssandra` (run for 1 hour):

`px-kafka` (run for 30min)