Improve metric intiialization to 0

Question

Improve metric intiialization to 0

mwitkow opened this issue 8 years ago · comments

As I said above: sadly it's not possible this is not possible from a technical point of view.
They could be initialised on every RPC. If we can't do that then we need to create a success and failure metric without labels, which will avoid the main pitfalls (with the downside of loss of granularity).

gRPC interceptors only see an RPC when it lands, they have no access to all registrations.

@brian-brazil, can you please clarify what you mean there? A single non-labeled metric? How will that help? What kind of pitfalls?

Brian Brazil · Answer 1 · Tue May 17 2016 06:46:45 GMT+0800 (China Standard Time)

We can't do anything about not knowing the different services, but once we intercept an rpc we know that that it exists and can initialise all labels then.

Michal Witkowski · Answer 2 · Tue May 17 2016 07:04:20 GMT+0800 (China Standard Time)

What are the possible problems?

On Mon, 16 May 2016, 23:46 Brian Brazil, notifications@github.com wrote:

We can't do anything about not knowing the different services, but once we
intercept an rpc we know that that it exists and can initialise all labels
then.

—
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#5 (comment)

Brian Brazil · Answer 3 · Tue May 17 2016 07:10:30 GMT+0800 (China Standard Time)

The issue with that approach is performance, as you've to initialise all the labels each time. A small cache would probably handle it.

Michal Witkowski · Answer 4 · Wed May 18 2016 05:28:07 GMT+0800 (China Standard Time)

I meant: what are the issues that uninitialized labels can cause? Are they related to division? Can you provide examples?

In case of the examples the started_total will always "exist" - because they're hit the first time time the interceptor sees the grpc_service+grpc_method combo.

Brian Brazil · Answer 5 · Wed May 18 2016 05:56:30 GMT+0800 (China Standard Time)

This issue is anytime you write an expression that presumes that labels always exist, but they don't

Your "unary request error percentage" example will return nothing if there has never been a failure.

Michal Witkowski · Answer 6 · Wed May 18 2016 06:02:49 GMT+0800 (China Standard Time)

Um. Ok, that's not that much of an issue. Most graphing solutions can assume 0.

Brian Brazil · Answer 7 · Wed May 18 2016 06:09:17 GMT+0800 (China Standard Time)

It can be quite a major issue, preventing alerts from firing depending exactly on how expressions are constructed. It's also really annoying to work with.

If the expression was written the other way around for example, a 100% error ratio would not trigger an alert.

Michal Witkowski · Answer 8 · Wed May 18 2016 06:19:35 GMT+0800 (China Standard Time)

Well there's not much that can be done here:
the gRPC service map of all services is private and not accessible from the Interceptor. Moreover, the RegisterService method is also public just so that generated code can use it. The ServiceDesc of the generated code is, handily, private as well

Since most of the useful functions depend on started_total and that guarantees to have all the labels that any other metric will have, I see no reason to complicate the code with some manual pre-population.

Brian Brazil · Answer 9 · Wed May 18 2016 06:36:41 GMT+0800 (China Standard Time)

I'm not talking about the services, I'm talking about the code label on grpc_server_handled_total.

Not pre-populating the codes means that only experts will be able to reliably graph or alert on failure rates.

Michal Witkowski · Answer 10 · Wed May 18 2016 07:41:27 GMT+0800 (China Standard Time)

You cannot pre-populate grpc_code since grpc_server_handled_total has grpc_service and grpc_method labels. Are you suggesting populating grpc_service=dummy? That's bad.

If you're talking about making sure that all codes are populated per-request, there isn't a good way to do it. Either you do it every time... or you would require a cache lookup, which would require locking between requests. Both would cause massive performance degradation.

If you feel like there are problems with discoverability of potential "fixed" values of labels, please add support for this in Prometheus itself by exposing label metadata (and metric metadata) in the server. I'm quite sure that many other people will encounter this problem. Just off the top of my head: http methods, http status codes etc.

Brian Brazil · Answer 11 · Wed May 18 2016 07:47:18 GMT+0800 (China Standard Time)

You cannot pre-populate grpc_code since grpc_server_handled_total has grpc_service and grpc_method labels. Are you suggesting populating grpc_service=dummy? That's bad.

I'm proposing that once we know a grpc_service and grpc_method label that we populate the code label for them. grpc_service=dummy wouldn't work in the general case.

If you feel like there are problems with discoverability of potential "fixed" values of labels, please add support for this in Prometheus itself by exposing label metadata (and metric metadata) in the server.

Our solution is to handle this in the client.

Michal Witkowski · Answer 12 · Wed May 18 2016 07:49:38 GMT+0800 (China Standard Time)

Brian, please read my response. Causing locking or excessive map lookups for an undefined benefit of discoverability (there are docs after all) is not the way forward.

Please reconsider adding label metadata fi you believe that is a big problem. Otherwise, you can't expect third party open source contributors to jump through hoops.

Brian Brazil · Answer 13 · Wed May 18 2016 09:04:39 GMT+0800 (China Standard Time)

The problem here is that gRPC doesn't expose sufficient information to provide reliable metrics, so we have to do the next best thing.

If you want a better way to do things, I'd suggest filing a feature request with gRPC to get access to the list of services and methods.

Causing locking or excessive map lookups for an undefined benefit of discoverability

The problem is not discovery, it is correctness. I've not mentioned label value discovery so I'm not sure where you got that from.

There are real failure modes. For example a 100% error ratio would result in an alert on the successful request ratio being too low never firing, leaving the user blind to the 100% failure ratio.

I've seen this type of problem myself many times, and postmortems mentioning this are not unknown.
The problem is that the failure mode is subtle. Everything seems to work fine until it doesn't, and the time it doesn't work is often during a nasty outage.

Please reconsider adding label metadata fi you believe that is a big problem. Otherwise, you can't expect third party open source contributors to jump through hoops.

I don't think plumbing this through to Prometheus is an appropriate solution. This is a client side problem, we're unlikely to add complexity to Prometheus, the exposition formats and all client libraries to save maybe one line of code in the standard case.

The cache lookup would be a similar cost to WithLabelValues (around 70ns), to which there's already at least 4 calls to for each stream.
(One reason direct instrumentation is better is that we can usually avoid all these concurrent map lookup costs by taking pointers at initialisation).

Just off the top of my head: http methods, http status codes etc.

This is unfortunately only a minor implementation detail of the challenges around HTTP.

Michal Witkowski · Answer 14 · Wed May 18 2016 15:17:01 GMT+0800 (China Standard Time)

@brian-brazil, the fact of the matter is: gRPC exposes a decent interface, good enough for other integrations (statsd, logging, auth). The interface is similar to HTTP, which many people use.
If Prometheus requires more, it IMHO is a failure on Prometheus' side.

Expecting a third-party integration library to jump through hoops on a "hot" request path is a desperate solution to a wider problem.

Nevertheless, I raised grpc/grpc-go#689, hoping to get the gRPC interface necessary to work around Prometheus issue.

As this affects other systems (HTTP at least) and not only gRPC, let's move the discussion to
prometheus/prometheus#1636 where it belongs.