palantir / witchcraft-go-server

A highly opinionated Go embedded application server for RESTy APIs

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Default server/client metrics too verbose, no option to trim down

mahmoudm opened this issue · comments

Some of the default request/server metrics being emitted break out into 12 distinct values (15m, 1m, 5m, Count, Max, Mean, meanRate, min, p50, p95, p99, stddev), which are probably not all equally valuable and are expensive, especially when emitted by daemons deployed on every single host.

I propose we move these metrics to a whitelist model, where we have a default filter of values that we care about for a given metric (i.e if i only care about P99, i should be able to only get that) and allow products to override that filter.

This gives us 1) ability for us to have a sane default that we can iterate on over time, depending on the value of metric vs cost trade-off 2) have products that have different constraints make different decisions (i.e daemons with health end points only might not care about any of those).

this is also relevant, in case you want to handle it with a similar approach: #125

Right now, the values are determined based on the metric type. For example, timer metrics emit 12 values as described in the ticket (https://github.com/palantir/pkg/blob/master/metrics/value.go#L123), while meter values emit 5 (https://github.com/palantir/pkg/blob/master/metrics/value.go#L105).

The interesting thing to me here is how we want to go about configuring what is outputted. Broadly, the options are:

  • Configure globally per metric type
    • For example, we could declare that all timer metrics should only emit Count, p99 and stddev
  • Configure on a per-metric basis
    • For example, for server.request.size, we could specify that only certain keys should be recorded

If we take the latter approach, we may want to have it be part of the same config as proposed for #125

good question -- my personal preference is to do the global one, as I find it unlikely that i care about specifying this differently on a per metric basis.

For global, the basic choices are blacklisting specific keys for specific types, or blacklisting specific keys globally.

For example, "meter" and "timer" metrics both have keys such as "count", "1m", "5m".

We could imagine either a blacklist like:

meter:
  - 1m
timer:
  - count

(the above would omit the "1m" key for all "meter" metrics, but the "timer" metrics would still have the "1m" key)

Or a global blacklist like:

- "p50"
- "1m"

That would blacklist the keys across all metrics.

Between those 2 options I'd lean towards keying by metric type (since it offers more flexibility), but open to suggestions.

by metric type makes sense to me

I would also suggest taking a stance on some of these instead of leaving it up to consumers to decide. For example, meanRate should be blacklisted across the board as should the 1m/5m/15m. Count is enough to get the rate over time.

This is already what we do in witchcraft server implementations for other languages (java in particular)

Yes, in sync with setting sane defaults. Anything else beyond the things you listed there?

We also don't publish p50,stddev,mean,min for java. so I think we should also remove those. Don't think there are legit usecases for them either.

Updated PR with these defaults. Next step is to iterate with @bmoylan on the feedback he provided (he flagged a good point about validating values and I outlined some possible approaches and am waiting to hear back).