cortexproject / cortex

A horizontally scalable, highly available, multi-tenant, long term Prometheus.

Home Page:https://cortexmetrics.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Cortex 1.16.0 Upgrade Error:LabelValues() from merge generic querier for label

dpericaxon opened this issue · comments

Describe the bug
After upgrading from Cortex 1.15.3 to 1.16.0 we started seeing errors like these on our ingesters:

caller=grpc_logging.go:43 level=warn duration=315.604µs method=/cortex.Ingester/LabelValues err="LabelValues() from merge generic querier for label DatabaseName: block: redact: fetching postings for matchers: context canceled" msg=gRPC
caller=grpc_logging.go:43 level=warn method=/cortex.Ingester/LabelValues duration=117.601µs err="LabelValues() from merge generic querier for label type: block: redact: fetching postings for matchers: context canceled" msg=gRPC

I checked the Queriers and don't see any errors, all I see are 200's.

On the distributors we started seeing these errors:

caller=push.go:53 level=warn org_id=fake msg="push refused" err="rpc error: code = Code(400) desc = maxFailure (quorum) on a given error family, rpc error: code = Code(400) desc = addr=redact:9095 state=ACTIVE zone=, rpc error: code = Code(400) desc = user=fake: err: duplicate sample for timestamp

The err: duplicate sample for timestamp is familiar to us but the beginning part of that log line related to maxFailure (quorum) is new. Are these expected?

We do see some of these happening on store-gateway:

"rpc error: code = Aborted desc = fetch postings for block redact: expanded matching posting: fetch and expand postings: get postings: reading postings: context canceled"

We aren't sure if these messages are just noise or if they were cancelled requests and thats why we see the error?

To Reproduce
Steps to reproduce the behavior:

  1. Upgrade Cortex from 1.15.3 to 1.16.0

Expected behavior
A clear and concise description of what you expected to happen.

Environment:

  • Infrastructure: AKS|K8 Version: 1.26.x
  • Deployment tool: Cortex Helm Chart

Additional Context

Context canceled are either timeout or context canceled. If it is requests being canceled then it should be fine. If it is timeout then probably you need to check your service.

caller=push.go:53 level=warn org_id=fake msg="push refused" err="rpc error: code = Code(400) desc = maxFailure (quorum) on a given error family, rpc error: code = Code(400) desc = addr=redact:9095 state=ACTIVE zone=, rpc error: code = Code(400) desc = user=fake: err: duplicate sample for timestamp

maxFailure (quorum) is also expected.

@yeya24 we haven't been able to figure out where the requests are coming from because when we look at the querier and ruler we don't see timeouts to correlate the error above to. Are there other places we should look to track it down? W

We decided to test rolling back and the errors stopped once we rolled back to v1.15.3

Are you seeing some impact?

@yeya24 could this be because we fixed some context cancelletion propagation and now we are actually canceling the queries that times out? (if so this is actually a good thing?)

I think to prove that this may be the case, could you keep the ingester on 1.16 and all other query components on 1.15.3 (queriers, querry-frontend and -if used - query scheduler?)

Hey @alanprot I was able to bump the ingesters up to v1.16.0 but we still see these in the logs:
ts=2023-12-15T14:26:27.753624641Z caller=grpc_logging.go:43 level=warn method=/cortex.Ingester/LabelValues duration=1.341282224s err="LabelValues() from merge generic querier for label instance: block: 01HredactR11R: fetching postings for matchers: context canceled" msg=gRPC ts=2023-12-15T14:26:39.726258409Z caller=grpc_logging.go:43 level=warn method=/cortex.Ingester/LabelValues duration=47.266176ms err="LabelValues() from merge generic querier for label type: block: 01HHredactV72C: fetching postings for matchers: context canceled" msg=gRPC ts=2023-12-15T14:26:55.57439212Z caller=grpc_logging.go:64 level=warn method=/cortex.Ingester/QueryStream duration=154.802µs err="block: 01HHredactP6B: get postings offset entry: context canceled" msg=gRPC ts=2023-12-15T14:27:12.000973732Z caller=grpc_logging.go:43 level=warn method=/cortex.Ingester/LabelValues duration=44.081753ms err="LabelValues() from merge generic querier for label type: block: 01HHredactV72C: fetching postings for matchers: context canceled" msg=gRPC ts=2023-12-15T14:27:12.068824183Z caller=grpc_logging.go:43 level=warn method=/cortex.Ingester/LabelValues duration=4.919062ms err="LabelValues() from merge generic querier for label type: block: 01HHPredactC09V72C: fetching postings for matchers: context canceled" msg=gRPC ts=2023-12-15T14:27:12.160218629Z caller=grpc_logging.go:43 level=warn method=/cortex.Ingester/LabelValues duration=16.558308ms err="LabelValues() from merge generic querier for label type: block: 01Hredact9V72C: fetching values of label type: context canceled" msg=gRPC ts=2023-12-15T14:27:12.33095397Z caller=grpc_logging.go:43 level=warn method=/cortex.Ingester/LabelValues duration=96.702µs err="LabelValues() from merge generic querier for label instance_name: fetching postings for matchers: context canceled" msg=gRPC

It seems like it continuously repeats. We haven't seen impact or have heard from users regarding any impact. Its just new log errors we haven't seen before and don't want to move to production in case there is some impact that we don't understand.

Hey @alanprot @yeya24 I wanted to check and see if you had any other thoughts or anything else I should trying testing?

@dpericaxon Did you see any real availability drop on your query side?

I think it is expected because we have quorum and some requests are canceled anyway.

@yeya24 we updated in a few environments and didn't notice any issues currently. So we might be good to close this! Thank you for all of your help!