Show backpressure in the UI
mwylde opened this issue · comments
When a node in the job graph is not keeping up with its input data, its input queue fills up and its upstreams cannot continue sending it data until it pulls items from the queue. This backpressures the upstream, which is blocked from doing work until it is able to write its messages downstream. In this way, we prevent faster upstreams from overloading slower downstreams. See this blogpost for more on the general theory of backpressure in streaming systems (although note the details are bit different in arroyo).
Figuring out whether and where backpressure is occurring is important for users to understand the behavior and performance of their pipelines.
In Arroyo, we have a metric arroyo_worker_tx_queue_rem
that reports how much space remains in a task's transmit queue. When this is 0, that means that the downstream node is causing backpressure on us.
This data should be visible in the UI. We already have infrastructure to pass metrics back to the UI (which currently powers the data rate graphs) so this would involve extending that API to add the arroyo_worker_tx_queue_rem
metric. For visualization, the simplest approach would be to color the nodes in the pipeline graph according to how backpressured they are (for example, as a fraction of the remaining queue size and total queue size).
The nodes in the graph represent logical operators, but in the physical execution each operator is subdivided into N
parallel subtasks. Similarly, each operator may have M
downstream nodes if the edge between them is a shuffle. The backpressure for an operator will be some combination of the backpressure of its parallel subtasks (median or min?).
So it will also be helpful to see the per-subtask backpressure, for example in the operator detail view that currently shows the data rate graphs.