Default server/client metrics too verbose, no option to trim down

Question

Default server/client metrics too verbose, no option to trim down

mahmoudm opened this issue 5 years ago · comments

Mahmoud Abdelsalam commented 5 years ago

Some of the default request/server metrics being emitted break out into 12 distinct values (15m, 1m, 5m, Count, Max, Mean, meanRate, min, p50, p95, p99, stddev), which are probably not all equally valuable and are expensive, especially when emitted by daemons deployed on every single host.

I propose we move these metrics to a whitelist model, where we have a default filter of values that we care about for a given metric (i.e if i only care about P99, i should be able to only get that) and allow products to override that filter.

This gives us 1) ability for us to have a sane default that we can iterate on over time, depending on the value of metric vs cost trade-off 2) have products that have different constraints make different decisions (i.e daemons with health end points only might not care about any of those).

Mahmoud Abdelsalam · Answer 1 · Sat Nov 02 2019 07:09:50 GMT+0800 (China Standard Time)

this is also relevant, in case you want to handle it with a similar approach: #125

Nick Miyake · Answer 2 · Tue Nov 05 2019 05:18:30 GMT+0800 (China Standard Time)

Right now, the values are determined based on the metric type. For example, timer metrics emit 12 values as described in the ticket (https://github.com/palantir/pkg/blob/master/metrics/value.go#L123), while meter values emit 5 (https://github.com/palantir/pkg/blob/master/metrics/value.go#L105).

The interesting thing to me here is how we want to go about configuring what is outputted. Broadly, the options are:

Configure globally per metric type
- For example, we could declare that all timer metrics should only emit Count, p99 and stddev
Configure on a per-metric basis
- For example, for server.request.size, we could specify that only certain keys should be recorded

If we take the latter approach, we may want to have it be part of the same config as proposed for #125

Mahmoud Abdelsalam · Answer 3 · Tue Nov 05 2019 06:18:21 GMT+0800 (China Standard Time)

good question -- my personal preference is to do the global one, as I find it unlikely that i care about specifying this differently on a per metric basis.

Nick Miyake · Answer 4 · Tue Nov 05 2019 06:43:18 GMT+0800 (China Standard Time)

For global, the basic choices are blacklisting specific keys for specific types, or blacklisting specific keys globally.

For example, "meter" and "timer" metrics both have keys such as "count", "1m", "5m".

We could imagine either a blacklist like:

meter:
  - 1m
timer:
  - count

(the above would omit the "1m" key for all "meter" metrics, but the "timer" metrics would still have the "1m" key)

Or a global blacklist like:

- "p50"
- "1m"

That would blacklist the keys across all metrics.

Between those 2 options I'd lean towards keying by metric type (since it offers more flexibility), but open to suggestions.

Mahmoud Abdelsalam · Answer 5 · Tue Nov 05 2019 08:27:59 GMT+0800 (China Standard Time)

by metric type makes sense to me

Ashray Jain · Answer 6 · Tue Nov 05 2019 17:44:29 GMT+0800 (China Standard Time)

I would also suggest taking a stance on some of these instead of leaving it up to consumers to decide. For example, meanRate should be blacklisted across the board as should the 1m/5m/15m. Count is enough to get the rate over time.

Ashray Jain · Answer 7 · Tue Nov 05 2019 17:45:16 GMT+0800 (China Standard Time)

This is already what we do in witchcraft server implementations for other languages (java in particular)

Nick Miyake · Answer 8 · Wed Nov 06 2019 00:59:43 GMT+0800 (China Standard Time)

Yes, in sync with setting sane defaults. Anything else beyond the things you listed there?

Ashray Jain · Answer 9 · Wed Nov 06 2019 01:16:16 GMT+0800 (China Standard Time)

We also don't publish p50,stddev,mean,min for java. so I think we should also remove those. Don't think there are legit usecases for them either.

Nick Miyake · Answer 10 · Wed Nov 06 2019 02:02:05 GMT+0800 (China Standard Time)

Updated PR with these defaults. Next step is to iterate with @bmoylan on the feedback he provided (he flagged a good point about validating values and I outlined some possible approaches and am waiting to hear back).