Histogram instead of summary with quantile for latency metrics
AlexZzz opened this issue · comments
- Version: any
- Platform: server and agent
- Subsystem: any
There're a lot of metrics with type summary
in spire. These metric type calculates requests count, sum of all latencies and latency distribuition as quantiles. There are two problems with quantiles:
- It's hard to compute. Comuptation is running in the software (in spire in this case), not in the database, so it costs some CPU on every request
- It's impossible to aggregate. Any aggregation on quantile will make no sense
Aggregation is the biggest problem in my case. It's impossible to create useful dashboard with a large number of metrics represented as quantiles if there're dozens, hundreds or thousands of spire-agents.
Much better option is to use histogram. One can calculate quantiles from histograms on the database. It won't be as accurate as quantiles from service, but accurate enough for most uses.
It would be nice to have latency histograms in spire:)
SPIRE supports multiple telemetry backends: https://github.com/spiffe/spire/blob/v1.10.0/doc/telemetry_config.md
I think there may have been some work to massage telemetry data into a histogram like structure inside the M3 sink code ... perhaps it is possible to do the same thing for the Prometheus sink? I don't think statsd dogstatsd etc supports it though?
Unfortunately I work only with prometheus, not statsd/m3db/etc. Can't help with them 😞
Here there's something about "bins" in statsd. Probably it's the same as prometheus histograms. I couldn't find something similar for dogstatsd.
Do you think it's possible to make histograms code global? Not backend-dependent as for now done for m3?
I think we're open to emitting some of these metrics as histograms. We'll need someone to figure out the best way to support this generically with our telemetry package (go-metrics) and supported backends.
I dug into this a little bit. The Prometheus sink uses hashicorp/go-metrics
, which uses Summaries by default
for AddSample
and AddSampleWithLabels
. If we want to support histograms, we ultimately need to call Histogram.Observe
instead of Summary.Observe
.
I'm not familiar with all the backends, but the dogstatsd
sink seems to be limited in a similar way, in that Datadog would support histograms if the metrics were submitted that way in the first place.
By way of comparison, the m3 sink depends on uber-go/tally
instead, and adds additional methods to produce a histogramfrom within AddSample
and AddSampleWithLabels
.
So as it stands, I can think of a few options:
- Add histogram data to SPIRE by modifying one backend at a time, and dropping the dependency on
go-metrics
for that backend (just like the m3 sink). We'd want to add a configuration option to avoid breaking changes. - Modify
go-metrics
to add histogram support. - Replacing the use of go-metrics entirely and going with something new (e.g. OpenTelemetry), more broadly supported, and more flexible. This seems like a breaking change.
Any other ideas?