cortexproject / cortex

A horizontally scalable, highly available, multi-tenant, long term Prometheus.

Home Page:https://cortexmetrics.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Metrics show append failures but logs show none

jakubgs opened this issue · comments

Describe the bug
I'm seeing lots of errors from cortex_distributor_ingester_append_failures_total metric on one of our distributors:

cortex_distributor_ingester_append_failures_total{ingester="10.10.0.211:9095",status="4xx",type="samples"} 318
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.211:9095",status="5xx",type="metadata"} 14
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.211:9095",status="5xx",type="samples"} 1670
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.212:9095",status="4xx",type="samples"} 248
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.212:9095",status="5xx",type="metadata"} 13
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.212:9095",status="5xx",type="samples"} 1991
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.213:9095",status="4xx",type="samples"} 68
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.213:9095",status="5xx",type="samples"} 23041
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.214:9095",status="4xx",type="samples"} 128
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.214:9095",status="5xx",type="metadata"} 44
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.214:9095",status="5xx",type="samples"} 5642
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.218:9095",status="4xx",type="samples"} 97
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.218:9095",status="5xx",type="samples"} 36903

These errors can be seen in the graph:

image

But when I log onto affected distributor and ingester I cannot find any errors logged, even on debug level:

ts=2024-03-06T09:53:16.116085832Z caller=grpc_logging.go:46 level=debug method=/cortex.Ingester/Push duration=4.272868ms msg="gRPC (success)"
ts=2024-03-06T09:53:16.133388069Z caller=grpc_logging.go:46 level=debug method=/cortex.Ingester/Push duration=5.270535ms msg="gRPC (success)"
ts=2024-03-06T09:53:16.146464557Z caller=grpc_logging.go:46 level=debug method=/cortex.Ingester/Push duration=4.954335ms msg="gRPC (success)"
ts=2024-03-06T09:53:16.157419974Z caller=grpc_logging.go:46 level=debug method=/cortex.Ingester/Push duration=4.418061ms msg="gRPC (success)"

To Reproduce
I have no idea.

Expected behavior
Either Distributor or Ingester shows errors in the logs so the issue can be actually debugged.

Environment:
Prometheus 2.50.1 sending to Cortex 1.16.0.

As recommended by @friedrichg I made this query for affected ingester:

histogram_quantile(0.99,sum(rate(cortex_request_duration_seconds_bucket{container="ingester",route="/cortex.Ingester/Push"}[5m])) by (le,route,status_code))

But the ingester appeared to be healthy:

image

Then it was recommended to query this for the affected distributor:

histogram_quantile(0.99,sum(rate(cortex_request_duration_seconds_bucket{container="distributor"}[5m])) by (le,route,status_code))

And the result was:

image

Which shows 400 errors for push requests. But no 400 errors were logged by distributor set to debug log level.

Share

sum(rate(cortex_request_duration_seconds_bucket{container="distributor"}[5m])) by (route,status_code)

If you can

The only two warnings I managed to find were:

ts=2024-03-06T09:40:15.643029269Z caller=push.go:53 level=warn org_id=fake msg="push refused" err="rpc error: code = Code(400) desc = maxFailure (quorum) on a given error family, rpc error: code = Code(400) desc = addr=10.10.0.211:9095 state=ACTIVE zone=, rpc error: code = Code(400) desc = user=fake: err: duplicate sample for timestamp. timestamp=2024-03-06T09:40:09.183Z, series={__name__=\"consul_consul_members_clients\", datacenter=\"aws-eu-central-1a\", fleet=\"consul.hq\", group=\",consul.hq,metrics,consul,\", instance=\"node-03.aws-eu-central-1a.consul.hq\", job=\"consul-metrics\", source=\"slave-01.aws-eu-central-1a.metrics.hq\"}"
ts=2024-03-06T09:57:16.640627601Z caller=push.go:53 level=warn org_id=fake msg="push refused" err="rpc error: code = Code(400) desc = maxFailure (quorum) on a given error family, rpc error: code = Code(400) desc = addr=10.10.0.211:9095 state=ACTIVE zone=, rpc error: code = Code(400) desc = user=fake: err: duplicate sample for timestamp. timestamp=2024-03-06T09:57:07.986Z, series={__name__=\"consul_consul_state_nodes\", datacenter=\"aws-eu-central-1a\", fleet=\"consul.hq\", group=\",consul.hq,metrics,consul,\", instance=\"node-01.aws-eu-central-1a.consul.hq\", job=\"consul-metrics\", source=\"slave-01.aws-eu-central-1a.metrics.hq\"}"

But that doesn't match with number of append failures in the distributor metric.

sum(rate(cortex_request_duration_seconds_bucket{container="distributor"}[5m])) by (route,status_code)

image

I managed to measure some packet loss between distributor and ingester:

image

But packet loss shouldn't cause 400s, it would cause packet send retries and eventually timeouts.

Not sure if it matters but distributors sometimes print stuff like this:


2024/03/06 10:13:50 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
2024/03/06 10:13:50 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
2024/03/06 10:16:30 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
2024/03/06 10:16:30 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".

Cannot find any issues about it in this repo.

Eventually the cluster explodes and distributor logs are full of this:

ts=2024-03-06T10:22:34.727004737Z
caller=logging.go:86
level=warn
traceID=300fd752f4382b17
msg="
  POST /api/v1/push (500) 7.643734ms
  Response: \"at least 2 live replicas required, could only find 0 - unhealthy instances: 10.10.0.214:9095,10.10.0.212:9095,10.10.0.211:9095,10.10.0.218:9095,10.10.0.213:9095\\n\"

Which honestly doesn't look like a warning level message and more like a sever error level message.

What baffles me tho is that ingester instances at no point were overloaded, the are mostly idle:

image

And the load never went above 2.0 on a machine with 8 cores:

image

So why does it suddenly become not "alive"? No idea.

Was asked to do this query:

histogram_quantile(0.99,sum(rate(cortex_kv_request_duration_seconds_bucket[5m])) without (container,job,pod,instance))

Result:

image

The distributor in question also was at no point overloaded, load below 3 on a host with 8 cores:

image

And plenty of free RAM left.

@jakubgs can you share your distributor config too? thanks

Sure:

---
target: 'distributor'
auth_enabled: false

# ---------------------- Configs --------------------------
configs:
  database:
    uri: 'memory://'

# ---------------------- Limits ---------------------------
limits:
  ingestion_rate: 1000000
  ingestion_burst_size: 2000000
  # Max active metrics with meta per user, per ingester.
  max_metadata_per_user: 128000
  # Limit impact from high cardinality.
  max_series_per_metric: 60000
  max_series_per_user: 5000000
  max_label_names_per_series: 30
  # Delete blocks containing samples older than this.
  compactor_blocks_retention_period: 0
  # Maximum accepted sample age before rejecting.
  reject_old_samples: true
  reject_old_samples_max_age: 10m
  # Allowed time window for ingestion of out-of-order samples.
  out_of_order_time_window: 5m

# ---------------------- Server ---------------------------
server:
  http_listen_address: '0.0.0.0'
  http_listen_port: 9092
  grpc_listen_address: '0.0.0.0'
  grpc_listen_port: 9095
  log_level: 'info'

  # Big queries need bigger message size.
  grpc_server_max_recv_msg_size: 104857600
  grpc_server_max_send_msg_size: 16777216
  # Bump gRPC concurrency to avoid delays.
  grpc_server_max_concurrent_streams: 1000

# ---------------------- Storage --------------------------
storage:
  engine: 'blocks'

# ---------------------- Distributor ----------------------
distributor:
  # Low values cause `context deadline exceeded` on push to ingesters.
  remote_timeout: '30s'

  ring:
    kvstore:
      store: etcd
      etcd:
        username: cortex
        password: SECRET
        endpoints: ['10.10.0.13:2379', '10.10.0.14:2379', '10.10.0.15:2379']

# ---------------------- Ingester -------------------------
ingester:
  lifecycler:
    tokens_file_path: '/var/tmp/cortex/wal/tokens'
    interface_names: [wg0]

    ring:
      kvstore:
        store: etcd
        etcd:
          username: cortex
          password: SECRET
          endpoints: ['10.10.0.13:2379', '10.10.0.14:2379', '10.10.0.15:2379']
  # Big queries need bigger message size.
  grpc_server_max_recv_msg_size: 104857600
  grpc_server_max_send_msg_size: 16777216
  # Bump gRPC concurrency to avoid delays.
  grpc_server_max_concurrent_streams: 1000

I have never changed that. I ran clusters all sizes without ever touching that.
I use https://github.com/cortexproject/cortex-jsonnet/blob/main/cortex/distributor.libsonnet#L24-L26 for distributors

I can remove it from distributors config, but I doubt that has any bearing on the latency.

I have applied your suggested changes to GRPC server configuration, but the latency issues persist.

From your message 10.10.0.214:9095,10.10.0.212:9095,10.10.0.211:9095,10.10.0.218:9095,10.10.0.213:9095 are the problematic ingesters. it might be worth checking if it is always the same ingesters in the message

Those are all my ingesters.

What I also wonder is why are distributor hosts generating 10x the amount of outgoing traffic compared to incoming:

image

How does that happen? Does the distributor receive stuff compressed and sends to ingesters uncompressed or something?

I don't understand, every time samples increase beyond 13k samples:

image

Ingester append failures(400s) start to raise:

image

And push latency explodes:

image
image

And eventually everything crashes, all while ingesters and distributors are mostly idle:

image

I'm baffled.

And when everything crashes distributor logs are full of this:

ts=2024-03-06T14:53:06.762792796Z caller=logging.go:86 level=warn traceID=33cb0c7912c40196
msg="POST /api/v1/push (500) 102.578691ms Response: \"No metric name label\\n\" ws: false; Content-Encoding: snappy; Content-Length: 51463; Content-Type: application/x-protobuf; Retry-Attempt: 50; User-Agent: Prometheus/2.50.1; X-Forwarded-For: 172.17.2.2; X-Forwarded-Proto: http; X-Prometheus-Remote-Write-Version: 0.1.0; X-Real-Ip: 172.17.2.2; "

No idea why. Related issue:

Restarting Prometheus makes it go away. No idea why.

I also get this when eventually all comes crashing down:

ts=2024-03-06T15:00:20.39654373Z caller=push.go:34 level=error org_id=fake err="unexpected EOF"
ts=2024-03-06T15:00:20.42643706Z caller=push.go:34 level=error org_id=fake err="unexpected EOF"
ts=2024-03-06T15:00:20.438756524Z caller=push.go:34 level=error org_id=fake err="unexpected EOF"

What I also wonder is why are distributor hosts generating 10x the amount of outgoing traffic compared to incoming:

That seems the case. From your configuration, I don't see you enable gRPC compression on the gRPC client to ingester. -ingester.client.grpc-compression.

Response: "at least 2 live replicas required, could only find 0 - unhealthy instances: 10.10.0.214:9095,10.10.0.212:9095,10.10.0.211:9095,10.10.0.218:9095,10.10.0.213:9095\n"

It seems that your Ingesters failed Health check. The reason of the failure could vary. They will be considered unhealthy if they don't heartbeat of Ring kvstore (etcd in this case) in time. I recommend you to check the Ingester Ring status.

There is a web UI exposed and you can look at whether those ingesters are healthy or not.

@yeya24 but I see no compression flag for distributors, these are the only ones I can find:
https://cortexmetrics.io/docs/configuration/configuration-file/

    # Use compression when sending messages. Supported values are: 'gzip',
    # 'snappy', 'snappy-block' ,'zstd' and '' (disable compression)
    # CLI flag: -query-scheduler.grpc-client-config.grpc-compression
    [grpc_compression: <string> | default = ""]
  # Use compression when sending messages. Supported values are: 'gzip',
  # 'snappy' and '' (disable compression)
  # CLI flag: -alertmanager.alertmanager-client.grpc-compression
  [grpc_compression: <string> | default = ""]
  # Use compression when sending messages. Supported values are: 'gzip',
  # 'snappy', 'snappy-block' ,'zstd' and '' (disable compression)
  # CLI flag: -querier.frontend-client.grpc-compression
  [grpc_compression: <string> | default = ""]
  # Use compression when sending messages. Supported values are: 'gzip',
  # 'snappy', 'snappy-block' ,'zstd' and '' (disable compression)
  # CLI flag: -ingester.client.grpc-compression
  [grpc_compression: <string> | default = ""]
  # Use compression when sending messages. Supported values are: 'gzip',
  # 'snappy' and '' (disable compression)
  # CLI flag: -querier.store-gateway-client.grpc-compression
  [grpc_compression: <string> | default = ""]
  # Use compression when sending messages. Supported values are: 'gzip',
  # 'snappy', 'snappy-block' ,'zstd' and '' (disable compression)
  # CLI flag: -frontend.grpc-client-config.grpc-compression
  [grpc_compression: <string> | default = ""]
  # Use compression when sending messages. Supported values are: 'gzip',
  # 'snappy', 'snappy-block' ,'zstd' and '' (disable compression)
  # CLI flag: -ruler.client.grpc-compression
  [grpc_compression: <string> | default = ""]

Is it just not documented?

You're suggesting to use ingester.client.grpc-compression, but the documentation says:

  # Use compression when sending messages. Supported values are: 'gzip',
  # 'snappy', 'snappy-block' ,'zstd' and '' (disable compression)

But isn't the distributor sending messages to ingester and not ingester sending them to distributor? Confusing.

Also, which type of compression do you recommend?

Oh, I see, the ingester_client configuration is different from ingster configuration.

Any good reason why compression isn't enabled by default?

Indeed, enabling compression has reduced outgoing traffic from distributors by 10x, which not is the same as incoming:

image

This alone might make a big difference.

Any good reason why compression isn't enabled by default?

Compression could cause CPU increase on both distributor and ingester, which might cause some unexpected issue.
We can think about if we want to turn it on by default.

One thing worth noting that I discovered was that we had abysmal WireGuard VPN performance, which we use to send metrics between hosts in our metrics fleet:

Connecting to host ingest-05.do-ams3.metrics.hq.wg, port 9092
[  5] local 10.10.0.73 port 37178 connected to 10.10.0.218 port 9092
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  9.35 MBytes  78.5 Mbits/sec    0    172 KBytes       
[  5]   1.00-2.00   sec  9.32 MBytes  78.2 Mbits/sec    0    180 KBytes       
[  5]   2.00-3.00   sec  8.83 MBytes  74.1 Mbits/sec    0    180 KBytes       
[  5]   3.00-4.00   sec  8.83 MBytes  74.1 Mbits/sec    0    180 KBytes       
[  5]   4.00-5.00   sec  7.73 MBytes  64.8 Mbits/sec    0    180 KBytes       
[  5]   5.00-6.00   sec  6.99 MBytes  58.6 Mbits/sec    0    180 KBytes       
[  5]   6.00-7.00   sec  7.30 MBytes  61.2 Mbits/sec    0    180 KBytes       
[  5]   7.00-8.00   sec  7.85 MBytes  65.9 Mbits/sec    0    180 KBytes       
[  5]   8.00-9.00   sec  8.03 MBytes  67.4 Mbits/sec    0    180 KBytes       
[  5]   9.00-10.00  sec  8.40 MBytes  70.5 Mbits/sec    0    180 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  82.6 MBytes  69.3 Mbits/sec    0             sender
[  5]   0.00-10.04  sec  82.2 MBytes  68.7 Mbits/sec                  receiver

As compared to performance without WireGuard:

[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   262 MBytes  2.20 Gbits/sec  5719    144 KBytes       
[  5]   1.00-2.00   sec   240 MBytes  2.01 Gbits/sec  3349    126 KBytes       
[  5]   2.00-3.00   sec   239 MBytes  2.00 Gbits/sec  3490    293 KBytes       
[  5]   3.00-4.00   sec   238 MBytes  1.99 Gbits/sec  9508    126 KBytes       
[  5]   4.00-5.00   sec   238 MBytes  1.99 Gbits/sec  10334    294 KBytes       
[  5]   5.00-6.00   sec   239 MBytes  2.00 Gbits/sec  10028    182 KBytes       
[  5]   6.00-7.00   sec   239 MBytes  2.00 Gbits/sec  8392    226 KBytes       
[  5]   7.00-8.00   sec   232 MBytes  1.95 Gbits/sec  4763    250 KBytes       
[  5]   8.00-9.00   sec   242 MBytes  2.03 Gbits/sec  3059    168 KBytes       
[  5]   9.00-10.00  sec   236 MBytes  1.98 Gbits/sec  3851    228 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  2.35 GBytes  2.02 Gbits/sec  62493             sender
[  5]   0.00-10.02  sec  2.35 GBytes  2.01 Gbits/sec                  receiver

Which is most probably to most recent DigitalOcean network maintenance they did in our DC last week.

This suggests to me that the 400 errors from ingesters distributors were getting were timeouts not due to ingester latency or I/O issues, but purely because wireguard VPN was throttling the network connection, dropping some packets, and delivering others way too late.

I'm not entirely sure how exactly this was causing 400s and 500s, but it seems like the most probably culprit in this case, as it was definitely not host CPU load or memory usage.

As far as I can tell my issue has been resolved and my cluster is now stable:

image

The fix involved several changes, but the biggest one being the GRPC compression:

  • Removal of secondary KV Store (Consul) as unnecessary (and also slower).
  • Discovery Distributors were using secondary KV Store instead of Primary due to mis-configuration.
  • Fixing of WireGuard VPN setup to avoid unnecessary restarts and instead reload configuration.
  • Fine-tuning of Nginx load balancer config in front of Distributor instances(worker_connections, worker_rlimit_nofile).
  • Enabling of GRPC compression for traffic between distributors and ingesters.

I still believe there is an issue that could be fixed in cortex. I believe the main reason for my issues was network bandwidth constraints caused by bad WireGuard performance on DigitalOcean network after their maintenance work.

For some reason this network performance issue appeared in metrics as 400s and 500s when pushing from distributors to ingesters. This confused me because I could not find any actual 400 and 500 errors in either distributor or ingester logs.
In hindsight it is possible these log messages were indicating the network performance degradation:

2024/03/06 10:13:50 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".

But I'm not actually sure. It would be useful to have(or learn about) a metric that could indicate network traffic saturation issues.

I still believe there is an issue that could be fixed in cortex

Distributor already had logs like below saying unhealthy Ingesters, which is an indication of some issues of the connection between Ingesters and Distributors. In this case, Distributor doesn't even send the request since not enough live replicas.

  Response: \"at least 2 live replicas required, could only find 0 - unhealthy instances: 10.10.0.214:9095,10.10.0.212:9095,10.10.0.211:9095,10.10.0.218:9095,10.10.0.213:9095\\n\"

That's a partially fair point, but I still don't understand why graphs was showing thousands of errors while Distributor was not logging any of them. That's a bug. What I also don't get is why was I getting 500s from ingesters when the connection saturation was the issue.

It's possible the payload sent to ingesters didn't arrive in its entirety and could not be processed, but in that case the resulting status code should be 422 Unprocessable Content and not 500 Internal Server Error, to indicate that:

the server understands the content type of the request entity, and the syntax of the request entity is correct, but it was unable to process the contained instructions.

And Distributor should be logging them in full, considering cortex_distributor_ingester_append_failures_total only provides 4xx and 5xx as values for label status, so it's not great for more exact debugging.

What do you think?

@jakubgs I think Distributor still logs the error when the quorum actually fails.
A lot of the times maybe some Ingesters failed to append but write can still succeed when quorum replicas accepts the write.

We can try to log every failed push to Ingester, but it can be excessive.

Yes, logging every single one would be excessive, I agree, it would be cleaner to have more detailed status values in metrics like cortex_distributor_ingester_append_failures_total instead, since indeed if you are making hundreds or thousands of pushes every minute that would fill the log with mostly noise.

But my point about using 422 instead of 500 in case of partial PUSH delivery still stands. 500 is a generic error that you use when you don't know what to use. If we are more specific with status codes in errors and in metrics we can make debugging specific failure states easier.