Streaming gRPC does not avoid 'message larger than max' error

Question

Streaming gRPC does not avoid 'message larger than max' error

akevdmeer opened this issue 4 months ago · comments

When querying Tempo with (streaming) gRPC using tempo-cli --use-grpc to do a large query, tempo-cli fails:

tempo-cli: error: main.metricsQueryRangeCmd.Run(): rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4560241 vs. 4194304)

.. and the same when using the same max message size on tempo-cli as we have previously configured on the server:

tempo-cli: error: main.metricsQueryRangeCmd.Run(): rpc error: code = Internal desc = response larger than the max (16842812 vs 16777216)

To Reproduce
Perform a query with tempo-cli that returns a lot of data, specifically I tested with
tempo-cli query api metrics localhost:3100 '{ selector } | quantile_over_time(duration, .99) by (span.http.url)' starttime endtime --use-grpc

Expected behavior
The streaming gRPC should avoid the max message size by chunking appropriately, allowing to run large queries.

Environment:
Tempo 2.5.0 on Kubernetes 1.28

Joe Elliott · Answer 1 · Tue Jun 11 2024 06:59:35 GMT+0800 (China Standard Time)

While the other streaming endpoints only send a diff for every message the metrics streaming endpoint currently sends all data every message.

We should update the metrics streaming endpoint to only send the diff by improving this method:

https://github.com/grafana/tempo/blob/main/modules/frontend/combiner/metrics_query_range.go#L63

akevdmeer · Answer 2 · Tue Jun 11 2024 16:43:27 GMT+0800 (China Standard Time)

I see, that provokes the error easily!

But isn't it still possible to exceed the max message size for the connection also with a / where this is a proper diff(), when it receives more fresh data than fits between two GRPCCollector updates?

Joe Elliott · Answer 3 · Tue Jun 11 2024 20:02:11 GMT+0800 (China Standard Time)

But isn't it still possible to exceed the max message size for the connection also with a / where this is a proper diff(), when it receives more fresh data than fits between two GRPCCollector updates?

This is correct. We could try to do something that cut diffs aggressively for larger datasets. It feels like it would be quite difficult to do well and would have some difficult to handle edge cases. I'm open to hearing a proposal, but this is currently a lower priority for us.

akevdmeer · Answer 4 · Wed Jun 12 2024 16:58:07 GMT+0800 (China Standard Time)

Clear thank you. A proper diff() for the metrics queries + larger grpc max size may solve the issue, practically speaking.