grafana / tempo

Grafana Tempo is a high volume, minimal dependency distributed tracing backend.

Home Page:https://grafana.com/oss/tempo/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Streaming gRPC does not avoid 'message larger than max' error

akevdmeer opened this issue · comments

When querying Tempo with (streaming) gRPC using tempo-cli --use-grpc to do a large query, tempo-cli fails:

tempo-cli: error: main.metricsQueryRangeCmd.Run(): rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4560241 vs. 4194304)

.. and the same when using the same max message size on tempo-cli as we have previously configured on the server:

tempo-cli: error: main.metricsQueryRangeCmd.Run(): rpc error: code = Internal desc = response larger than the max (16842812 vs 16777216)

To Reproduce
Perform a query with tempo-cli that returns a lot of data, specifically I tested with
tempo-cli query api metrics localhost:3100 '{ selector } | quantile_over_time(duration, .99) by (span.http.url)' starttime endtime --use-grpc

Expected behavior
The streaming gRPC should avoid the max message size by chunking appropriately, allowing to run large queries.

Environment:
Tempo 2.5.0 on Kubernetes 1.28

While the other streaming endpoints only send a diff for every message the metrics streaming endpoint currently sends all data every message.

We should update the metrics streaming endpoint to only send the diff by improving this method:

https://github.com/grafana/tempo/blob/main/modules/frontend/combiner/metrics_query_range.go#L63

I see, that provokes the error easily!

But isn't it still possible to exceed the max message size for the connection also with a / where this is a proper diff(), when it receives more fresh data than fits between two GRPCCollector updates?

But isn't it still possible to exceed the max message size for the connection also with a / where this is a proper diff(), when it receives more fresh data than fits between two GRPCCollector updates?

This is correct. We could try to do something that cut diffs aggressively for larger datasets. It feels like it would be quite difficult to do well and would have some difficult to handle edge cases. I'm open to hearing a proposal, but this is currently a lower priority for us.

Clear thank you. A proper diff() for the metrics queries + larger grpc max size may solve the issue, practically speaking.