ArroyoSystems / arroyo

Distributed stream processing engine in Rust

Home Page:https://arroyo.dev

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Show backpressure in the UI

mwylde opened this issue · comments

When a node in the job graph is not keeping up with its input data, its input queue fills up and its upstreams cannot continue sending it data until it pulls items from the queue. This backpressures the upstream, which is blocked from doing work until it is able to write its messages downstream. In this way, we prevent faster upstreams from overloading slower downstreams. See this blogpost for more on the general theory of backpressure in streaming systems (although note the details are bit different in arroyo).

Figuring out whether and where backpressure is occurring is important for users to understand the behavior and performance of their pipelines.

In Arroyo, we have a metric arroyo_worker_tx_queue_rem that reports how much space remains in a task's transmit queue. When this is 0, that means that the downstream node is causing backpressure on us.

This data should be visible in the UI. We already have infrastructure to pass metrics back to the UI (which currently powers the data rate graphs) so this would involve extending that API to add the arroyo_worker_tx_queue_rem metric. For visualization, the simplest approach would be to color the nodes in the pipeline graph according to how backpressured they are (for example, as a fraction of the remaining queue size and total queue size).

The nodes in the graph represent logical operators, but in the physical execution each operator is subdivided into N parallel subtasks. Similarly, each operator may have M downstream nodes if the edge between them is a shuffle. The backpressure for an operator will be some combination of the backpressure of its parallel subtasks (median or min?).

So it will also be helpful to see the per-subtask backpressure, for example in the operator detail view that currently shows the data rate graphs.