m3db / m3

M3 monorepo - Distributed TSDB, Aggregator and Query Engine, Prometheus Sidecar, Graphite Compatible, Metrics Platform

Home Page:https://m3db.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

histogram_quantile does not work correctly for higher quantiles (missing sorting of the buckets)?

jodzga opened this issue · comments

Function histogram_quantile does not work correctly for higher quantiles.

The image below shows the edge case, for quantile 1.0 but the same is true for high quantiles, close to 1.0 e.g. 0.9999.

Screen Shot 2021-08-30 at 20 06 57

Above, you can see that buckets "0.256" and above have counts 4284. The 1-quantile should be 0.256 but on the top graph you can see that it is calculated to be 8.192.

The source code for the function that computes the quantile in Prometheus:
https://github.com/prometheus/prometheus/blob/main/promql/quantile.go#L73

The same for m3:

func bucketQuantile(q float64, buckets []bucketValue) float64 {

The sort.Search is supposed to return the smallest bucket but it seems that the m3 version does not sort the buckets first. Perhaps m3 version assumes that buckets are sorted somewhere outside the function?

Is this the issue that 0d01d62 was intended to fix?

@jodzga - per @markj-db have you looked at 0d01d62 to see if that fixes this issue? Also, if you have a fix in mind, we would be happy to review a PR.

Thanks @markj-db and @gibbscullen for following up.

Mark, you are right lack of sorting is not the problem. The m3 implementation does sort the buckets by upperBound in sanitizeBuckets function:

func sanitizeBuckets(bucketMap bucketedSeries) validSeriesBuckets {

I figured out what the problem is.

I was using the following query: histogram_quantile(1.0, sum(rate(coredns_dns_request_duration_seconds_bucket{}[5m])) by (le)).

The problem is that the rate function, while doing its magic, turns counts into fractions. Most fractions can't be expressed exactly as a floating point number (IEEE754 standard). The resulting number that represents the fraction is just an approximation that uses up all bits of mantissa. Addition of such numbers is neither commutative nor associative operation, see example here: https://play.golang.org/p/5jGVFNhGZo8.

As a consequence, the sum of counts differ very slightly depending on in what order numbers were summed up. Here is an example graph that shows this problem:
Screen Shot 2021-09-15 at 14 29 14

From my initial example:
Screen Shot 2021-09-15 at 14 17 56

Buckets 0.256 and above seem to have the same count.

Here is the same data point but showing numbers with full precision:
Screen Shot 2021-09-15 at 14 18 27

A a quick and dirty workaround is to get rid of the "noise" by rounding off some of the mantissa to make sum commutative and associative again. Unfortunately AFAIK Prometheus does not have a function for rounding number to certain precision so, as a workaround, I can simply multiply numbers by very large number, round to integer and then divide by the same very large number.
For example, instead of: histogram_quantile(1.0, sum(rate(coredns_dns_request_duration_seconds_bucket{}[5m])) by (le)) I can use: histogram_quantile(1.0, sum(round(1000000000*rate(coredns_dns_request_duration_seconds_bucket{}[5m]))) by (le) / 1000000000).

I can think of a couple of ways this issue could be "mitigated" in Prometheus/m3. These are just brainstorming ideas, not fully thought through:

  1. Do not depend on associativity and commutativity of sum by making order of operations deterministic e.g. sort all numbers before summing them up.
  2. Get back associativity and commutativity by making rate and other functions return numbers that do not use full mantissa (decrease precision).

@gibbscullen I believe this is a pretty serious problem. AFAICT histogram_quantile queries for high percentiles will return wrong result. I'd love to help out but I am not a golang developer and onboarding (getting to the point where I can go through dev cycle: build, test, modify) could take me a significant amount of time.
Would someone from the m3 maintainers team be interested in collaborating on fixing this issue?

@jodzga the example you gave of sum(rate(foo[5m])) - sum(rate(foo[5m])) is interesting because it implies the order of values being summed is non-deterministic. Perhaps it would be useful to look at how sum() is implemented. My naive read of the M3 code suggests that query execution is parallelized:

https://github.com/m3db/m3/blob/master/src/query/executor/state.go#L192

This could well be a source of non-determinism if not handled carefully -- I'm not sure how the query is distributed or how any reduction operations are done.

@jodzga @markj-db sounds like you may already be working on this together, but one idea would be to change the default here

// DefaultEngine is the default query engine.
DefaultEngine string `yaml:"defaultEngine"`
to Prometheus Engine (prometheus)