Metrics show append failures but logs show none

Question

Metrics show append failures but logs show none

jakubgs opened this issue 3 months ago · comments

Describe the bug
I'm seeing lots of errors from cortex_distributor_ingester_append_failures_total metric on one of our distributors:

cortex_distributor_ingester_append_failures_total{ingester="10.10.0.211:9095",status="4xx",type="samples"} 318
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.211:9095",status="5xx",type="metadata"} 14
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.211:9095",status="5xx",type="samples"} 1670
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.212:9095",status="4xx",type="samples"} 248
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.212:9095",status="5xx",type="metadata"} 13
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.212:9095",status="5xx",type="samples"} 1991
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.213:9095",status="4xx",type="samples"} 68
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.213:9095",status="5xx",type="samples"} 23041
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.214:9095",status="4xx",type="samples"} 128
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.214:9095",status="5xx",type="metadata"} 44
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.214:9095",status="5xx",type="samples"} 5642
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.218:9095",status="4xx",type="samples"} 97
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.218:9095",status="5xx",type="samples"} 36903

These errors can be seen in the graph:

But when I log onto affected distributor and ingester I cannot find any errors logged, even on debug level:

ts=2024-03-06T09:53:16.116085832Z caller=grpc_logging.go:46 level=debug method=/cortex.Ingester/Push duration=4.272868ms msg="gRPC (success)"
ts=2024-03-06T09:53:16.133388069Z caller=grpc_logging.go:46 level=debug method=/cortex.Ingester/Push duration=5.270535ms msg="gRPC (success)"
ts=2024-03-06T09:53:16.146464557Z caller=grpc_logging.go:46 level=debug method=/cortex.Ingester/Push duration=4.954335ms msg="gRPC (success)"
ts=2024-03-06T09:53:16.157419974Z caller=grpc_logging.go:46 level=debug method=/cortex.Ingester/Push duration=4.418061ms msg="gRPC (success)"

To Reproduce
I have no idea.

Expected behavior
Either Distributor or Ingester shows errors in the logs so the issue can be actually debugged.

Environment:
Prometheus 2.50.1 sending to Cortex 1.16.0.

Jakub · Answer 1 · Wed Mar 06 2024 18:01:49 GMT+0800 (China Standard Time)

As recommended by @friedrichg I made this query for affected ingester:

histogram_quantile(0.99,sum(rate(cortex_request_duration_seconds_bucket{container="ingester",route="/cortex.Ingester/Push"}[5m])) by (le,route,status_code))

But the ingester appeared to be healthy:

Jakub · Answer 2 · Wed Mar 06 2024 18:03:44 GMT+0800 (China Standard Time)

Then it was recommended to query this for the affected distributor:

histogram_quantile(0.99,sum(rate(cortex_request_duration_seconds_bucket{container="distributor"}[5m])) by (le,route,status_code))

And the result was:

Which shows 400 errors for push requests. But no 400 errors were logged by distributor set to debug log level.

Friedrich Gonzalez · Answer 3 · Wed Mar 06 2024 18:05:41 GMT+0800 (China Standard Time)

Share

sum(rate(cortex_request_duration_seconds_bucket{container="distributor"}[5m])) by (route,status_code)

If you can

Jakub · Answer 4 · Wed Mar 06 2024 18:06:53 GMT+0800 (China Standard Time)

The only two warnings I managed to find were:

ts=2024-03-06T09:40:15.643029269Z caller=push.go:53 level=warn org_id=fake msg="push refused" err="rpc error: code = Code(400) desc = maxFailure (quorum) on a given error family, rpc error: code = Code(400) desc = addr=10.10.0.211:9095 state=ACTIVE zone=, rpc error: code = Code(400) desc = user=fake: err: duplicate sample for timestamp. timestamp=2024-03-06T09:40:09.183Z, series={__name__=\"consul_consul_members_clients\", datacenter=\"aws-eu-central-1a\", fleet=\"consul.hq\", group=\",consul.hq,metrics,consul,\", instance=\"node-03.aws-eu-central-1a.consul.hq\", job=\"consul-metrics\", source=\"slave-01.aws-eu-central-1a.metrics.hq\"}"
ts=2024-03-06T09:57:16.640627601Z caller=push.go:53 level=warn org_id=fake msg="push refused" err="rpc error: code = Code(400) desc = maxFailure (quorum) on a given error family, rpc error: code = Code(400) desc = addr=10.10.0.211:9095 state=ACTIVE zone=, rpc error: code = Code(400) desc = user=fake: err: duplicate sample for timestamp. timestamp=2024-03-06T09:57:07.986Z, series={__name__=\"consul_consul_state_nodes\", datacenter=\"aws-eu-central-1a\", fleet=\"consul.hq\", group=\",consul.hq,metrics,consul,\", instance=\"node-01.aws-eu-central-1a.consul.hq\", job=\"consul-metrics\", source=\"slave-01.aws-eu-central-1a.metrics.hq\"}"

But that doesn't match with number of append failures in the distributor metric.

Jakub · Answer 5 · Wed Mar 06 2024 18:08:26 GMT+0800 (China Standard Time)

sum(rate(cortex_request_duration_seconds_bucket{container="distributor"}[5m])) by (route,status_code)

Jakub · Answer 6 · Wed Mar 06 2024 18:09:54 GMT+0800 (China Standard Time)

I managed to measure some packet loss between distributor and ingester:

But packet loss shouldn't cause 400s, it would cause packet send retries and eventually timeouts.

Jakub · Answer 7 · Wed Mar 06 2024 18:19:14 GMT+0800 (China Standard Time)

Not sure if it matters but distributors sometimes print stuff like this:


2024/03/06 10:13:50 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
2024/03/06 10:13:50 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
2024/03/06 10:16:30 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
2024/03/06 10:16:30 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".

Cannot find any issues about it in this repo.

Jakub · Answer 8 · Wed Mar 06 2024 18:25:55 GMT+0800 (China Standard Time)

Eventually the cluster explodes and distributor logs are full of this:

ts=2024-03-06T10:22:34.727004737Z
caller=logging.go:86
level=warn
traceID=300fd752f4382b17
msg="
  POST /api/v1/push (500) 7.643734ms
  Response: \"at least 2 live replicas required, could only find 0 - unhealthy instances: 10.10.0.214:9095,10.10.0.212:9095,10.10.0.211:9095,10.10.0.218:9095,10.10.0.213:9095\\n\"

Which honestly doesn't look like a warning level message and more like a sever error level message.

What baffles me tho is that ingester instances at no point were overloaded, the are mostly idle:

And the load never went above 2.0 on a machine with 8 cores:

So why does it suddenly become not "alive"? No idea.

Jakub · Answer 9 · Wed Mar 06 2024 18:30:41 GMT+0800 (China Standard Time)

Was asked to do this query:

histogram_quantile(0.99,sum(rate(cortex_kv_request_duration_seconds_bucket[5m])) without (container,job,pod,instance))

Result:

Jakub · Answer 10 · Wed Mar 06 2024 18:34:09 GMT+0800 (China Standard Time)

The distributor in question also was at no point overloaded, load below 3 on a host with 8 cores:

And plenty of free RAM left.

Friedrich Gonzalez · Answer 11 · Wed Mar 06 2024 18:41:46 GMT+0800 (China Standard Time)

@jakubgs can you share your distributor config too? thanks

Jakub · Answer 12 · Wed Mar 06 2024 18:43:54 GMT+0800 (China Standard Time)

Sure:

---
target: 'distributor'
auth_enabled: false

# ---------------------- Configs --------------------------
configs:
  database:
    uri: 'memory://'

# ---------------------- Limits ---------------------------
limits:
  ingestion_rate: 1000000
  ingestion_burst_size: 2000000
  # Max active metrics with meta per user, per ingester.
  max_metadata_per_user: 128000
  # Limit impact from high cardinality.
  max_series_per_metric: 60000
  max_series_per_user: 5000000
  max_label_names_per_series: 30
  # Delete blocks containing samples older than this.
  compactor_blocks_retention_period: 0
  # Maximum accepted sample age before rejecting.
  reject_old_samples: true
  reject_old_samples_max_age: 10m
  # Allowed time window for ingestion of out-of-order samples.
  out_of_order_time_window: 5m

# ---------------------- Server ---------------------------
server:
  http_listen_address: '0.0.0.0'
  http_listen_port: 9092
  grpc_listen_address: '0.0.0.0'
  grpc_listen_port: 9095
  log_level: 'info'

  # Big queries need bigger message size.
  grpc_server_max_recv_msg_size: 104857600
  grpc_server_max_send_msg_size: 16777216
  # Bump gRPC concurrency to avoid delays.
  grpc_server_max_concurrent_streams: 1000

# ---------------------- Storage --------------------------
storage:
  engine: 'blocks'

# ---------------------- Distributor ----------------------
distributor:
  # Low values cause `context deadline exceeded` on push to ingesters.
  remote_timeout: '30s'

  ring:
    kvstore:
      store: etcd
      etcd:
        username: cortex
        password: SECRET
        endpoints: ['10.10.0.13:2379', '10.10.0.14:2379', '10.10.0.15:2379']

# ---------------------- Ingester -------------------------
ingester:
  lifecycler:
    tokens_file_path: '/var/tmp/cortex/wal/tokens'
    interface_names: [wg0]

    ring:
      kvstore:
        store: etcd
        etcd:
          username: cortex
          password: SECRET
          endpoints: ['10.10.0.13:2379', '10.10.0.14:2379', '10.10.0.15:2379']

Friedrich Gonzalez · Answer 13 · Wed Mar 06 2024 18:49:56 GMT+0800 (China Standard Time)

  # Big queries need bigger message size.
  grpc_server_max_recv_msg_size: 104857600
  grpc_server_max_send_msg_size: 16777216
  # Bump gRPC concurrency to avoid delays.
  grpc_server_max_concurrent_streams: 1000

I have never changed that. I ran clusters all sizes without ever touching that.
I use https://github.com/cortexproject/cortex-jsonnet/blob/main/cortex/distributor.libsonnet#L24-L26 for distributors

Friedrich Gonzalez · Answer 14 · Wed Mar 06 2024 18:52:17 GMT+0800 (China Standard Time)

It's needed for ingesters though https://github.com/cortexproject/cortex-jsonnet/blob/main/cortex/ingester.libsonnet#L34-L36

Jakub · Answer 15 · Wed Mar 06 2024 20:42:56 GMT+0800 (China Standard Time)

I can remove it from distributors config, but I doubt that has any bearing on the latency.

Jakub · Answer 16 · Wed Mar 06 2024 21:54:52 GMT+0800 (China Standard Time)

I have applied your suggested changes to GRPC server configuration, but the latency issues persist.

Friedrich Gonzalez · Answer 17 · Wed Mar 06 2024 22:05:16 GMT+0800 (China Standard Time)

From your message 10.10.0.214:9095,10.10.0.212:9095,10.10.0.211:9095,10.10.0.218:9095,10.10.0.213:9095 are the problematic ingesters. it might be worth checking if it is always the same ingesters in the message

Jakub · Answer 18 · Wed Mar 06 2024 22:09:10 GMT+0800 (China Standard Time)

Those are all my ingesters.

Jakub · Answer 19 · Wed Mar 06 2024 22:19:33 GMT+0800 (China Standard Time)

What I also wonder is why are distributor hosts generating 10x the amount of outgoing traffic compared to incoming:

How does that happen? Does the distributor receive stuff compressed and sends to ingesters uncompressed or something?

Jakub · Answer 20 · Wed Mar 06 2024 22:46:59 GMT+0800 (China Standard Time)

I don't understand, every time samples increase beyond 13k samples:

Ingester append failures(400s) start to raise:

And push latency explodes:

And eventually everything crashes, all while ingesters and distributors are mostly idle:

I'm baffled.

Jakub · Answer 21 · Wed Mar 06 2024 22:53:52 GMT+0800 (China Standard Time)

And when everything crashes distributor logs are full of this:

ts=2024-03-06T14:53:06.762792796Z caller=logging.go:86 level=warn traceID=33cb0c7912c40196
msg="POST /api/v1/push (500) 102.578691ms Response: \"No metric name label\\n\" ws: false; Content-Encoding: snappy; Content-Length: 51463; Content-Type: application/x-protobuf; Retry-Attempt: 50; User-Agent: Prometheus/2.50.1; X-Forwarded-For: 172.17.2.2; X-Forwarded-Proto: http; X-Prometheus-Remote-Write-Version: 0.1.0; X-Real-Ip: 172.17.2.2; "

No idea why. Related issue:

#5802

Restarting Prometheus makes it go away. No idea why.

Jakub · Answer 22 · Wed Mar 06 2024 23:02:40 GMT+0800 (China Standard Time)

I also get this when eventually all comes crashing down:

ts=2024-03-06T15:00:20.39654373Z caller=push.go:34 level=error org_id=fake err="unexpected EOF"
ts=2024-03-06T15:00:20.42643706Z caller=push.go:34 level=error org_id=fake err="unexpected EOF"
ts=2024-03-06T15:00:20.438756524Z caller=push.go:34 level=error org_id=fake err="unexpected EOF"

Ben Ye · Answer 23 · Thu Mar 07 2024 00:41:48 GMT+0800 (China Standard Time)

What I also wonder is why are distributor hosts generating 10x the amount of outgoing traffic compared to incoming:

That seems the case. From your configuration, I don't see you enable gRPC compression on the gRPC client to ingester. -ingester.client.grpc-compression.

Ben Ye · Answer 24 · Thu Mar 07 2024 00:46:06 GMT+0800 (China Standard Time)

Response: "at least 2 live replicas required, could only find 0 - unhealthy instances: 10.10.0.214:9095,10.10.0.212:9095,10.10.0.211:9095,10.10.0.218:9095,10.10.0.213:9095\n"

It seems that your Ingesters failed Health check. The reason of the failure could vary. They will be considered unhealthy if they don't heartbeat of Ring kvstore (etcd in this case) in time. I recommend you to check the Ingester Ring status.

There is a web UI exposed and you can look at whether those ingesters are healthy or not.

Jakub · Answer 25 · Thu Mar 07 2024 01:08:19 GMT+0800 (China Standard Time)

@yeya24 but I see no compression flag for distributors, these are the only ones I can find:
https://cortexmetrics.io/docs/configuration/configuration-file/

    # Use compression when sending messages. Supported values are: 'gzip',
    # 'snappy', 'snappy-block' ,'zstd' and '' (disable compression)
    # CLI flag: -query-scheduler.grpc-client-config.grpc-compression
    [grpc_compression: <string> | default = ""]

  # Use compression when sending messages. Supported values are: 'gzip',
  # 'snappy' and '' (disable compression)
  # CLI flag: -alertmanager.alertmanager-client.grpc-compression
  [grpc_compression: <string> | default = ""]

  # Use compression when sending messages. Supported values are: 'gzip',
  # 'snappy', 'snappy-block' ,'zstd' and '' (disable compression)
  # CLI flag: -querier.frontend-client.grpc-compression
  [grpc_compression: <string> | default = ""]

  # Use compression when sending messages. Supported values are: 'gzip',
  # 'snappy', 'snappy-block' ,'zstd' and '' (disable compression)
  # CLI flag: -ingester.client.grpc-compression
  [grpc_compression: <string> | default = ""]

  # Use compression when sending messages. Supported values are: 'gzip',
  # 'snappy' and '' (disable compression)
  # CLI flag: -querier.store-gateway-client.grpc-compression
  [grpc_compression: <string> | default = ""]

  # Use compression when sending messages. Supported values are: 'gzip',
  # 'snappy', 'snappy-block' ,'zstd' and '' (disable compression)
  # CLI flag: -frontend.grpc-client-config.grpc-compression
  [grpc_compression: <string> | default = ""]

  # Use compression when sending messages. Supported values are: 'gzip',
  # 'snappy', 'snappy-block' ,'zstd' and '' (disable compression)
  # CLI flag: -ruler.client.grpc-compression
  [grpc_compression: <string> | default = ""]

Is it just not documented?

Jakub · Answer 26 · Thu Mar 07 2024 01:09:07 GMT+0800 (China Standard Time)

You're suggesting to use ingester.client.grpc-compression, but the documentation says:

  # Use compression when sending messages. Supported values are: 'gzip',
  # 'snappy', 'snappy-block' ,'zstd' and '' (disable compression)

But isn't the distributor sending messages to ingester and not ingester sending them to distributor? Confusing.

Jakub · Answer 27 · Thu Mar 07 2024 01:18:15 GMT+0800 (China Standard Time)

Also, which type of compression do you recommend?

Jakub · Answer 28 · Thu Mar 07 2024 01:28:09 GMT+0800 (China Standard Time)

Oh, I see, the ingester_client configuration is different from ingster configuration.

Jakub · Answer 29 · Thu Mar 07 2024 01:35:52 GMT+0800 (China Standard Time)

Any good reason why compression isn't enabled by default?

Jakub · Answer 30 · Thu Mar 07 2024 01:56:47 GMT+0800 (China Standard Time)

Indeed, enabling compression has reduced outgoing traffic from distributors by 10x, which not is the same as incoming:

This alone might make a big difference.

Ben Ye · Answer 31 · Thu Mar 07 2024 02:28:12 GMT+0800 (China Standard Time)

Any good reason why compression isn't enabled by default?

Compression could cause CPU increase on both distributor and ingester, which might cause some unexpected issue.
We can think about if we want to turn it on by default.

Jakub · Answer 32 · Thu Mar 07 2024 02:50:05 GMT+0800 (China Standard Time)

One thing worth noting that I discovered was that we had abysmal WireGuard VPN performance, which we use to send metrics between hosts in our metrics fleet:

Connecting to host ingest-05.do-ams3.metrics.hq.wg, port 9092
[  5] local 10.10.0.73 port 37178 connected to 10.10.0.218 port 9092
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  9.35 MBytes  78.5 Mbits/sec    0    172 KBytes       
[  5]   1.00-2.00   sec  9.32 MBytes  78.2 Mbits/sec    0    180 KBytes       
[  5]   2.00-3.00   sec  8.83 MBytes  74.1 Mbits/sec    0    180 KBytes       
[  5]   3.00-4.00   sec  8.83 MBytes  74.1 Mbits/sec    0    180 KBytes       
[  5]   4.00-5.00   sec  7.73 MBytes  64.8 Mbits/sec    0    180 KBytes       
[  5]   5.00-6.00   sec  6.99 MBytes  58.6 Mbits/sec    0    180 KBytes       
[  5]   6.00-7.00   sec  7.30 MBytes  61.2 Mbits/sec    0    180 KBytes       
[  5]   7.00-8.00   sec  7.85 MBytes  65.9 Mbits/sec    0    180 KBytes       
[  5]   8.00-9.00   sec  8.03 MBytes  67.4 Mbits/sec    0    180 KBytes       
[  5]   9.00-10.00  sec  8.40 MBytes  70.5 Mbits/sec    0    180 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  82.6 MBytes  69.3 Mbits/sec    0             sender
[  5]   0.00-10.04  sec  82.2 MBytes  68.7 Mbits/sec                  receiver

As compared to performance without WireGuard:

[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   262 MBytes  2.20 Gbits/sec  5719    144 KBytes       
[  5]   1.00-2.00   sec   240 MBytes  2.01 Gbits/sec  3349    126 KBytes       
[  5]   2.00-3.00   sec   239 MBytes  2.00 Gbits/sec  3490    293 KBytes       
[  5]   3.00-4.00   sec   238 MBytes  1.99 Gbits/sec  9508    126 KBytes       
[  5]   4.00-5.00   sec   238 MBytes  1.99 Gbits/sec  10334    294 KBytes       
[  5]   5.00-6.00   sec   239 MBytes  2.00 Gbits/sec  10028    182 KBytes       
[  5]   6.00-7.00   sec   239 MBytes  2.00 Gbits/sec  8392    226 KBytes       
[  5]   7.00-8.00   sec   232 MBytes  1.95 Gbits/sec  4763    250 KBytes       
[  5]   8.00-9.00   sec   242 MBytes  2.03 Gbits/sec  3059    168 KBytes       
[  5]   9.00-10.00  sec   236 MBytes  1.98 Gbits/sec  3851    228 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  2.35 GBytes  2.02 Gbits/sec  62493             sender
[  5]   0.00-10.02  sec  2.35 GBytes  2.01 Gbits/sec                  receiver

Which is most probably to most recent DigitalOcean network maintenance they did in our DC last week.

This suggests to me that the 400 errors from ingesters distributors were getting were timeouts not due to ingester latency or I/O issues, but purely because wireguard VPN was throttling the network connection, dropping some packets, and delivering others way too late.

I'm not entirely sure how exactly this was causing 400s and 500s, but it seems like the most probably culprit in this case, as it was definitely not host CPU load or memory usage.

Jakub · Answer 33 · Thu Mar 07 2024 16:57:30 GMT+0800 (China Standard Time)

As far as I can tell my issue has been resolved and my cluster is now stable:

The fix involved several changes, but the biggest one being the GRPC compression:

Removal of secondary KV Store (Consul) as unnecessary (and also slower).
Discovery Distributors were using secondary KV Store instead of Primary due to mis-configuration.
Fixing of WireGuard VPN setup to avoid unnecessary restarts and instead reload configuration.
Fine-tuning of Nginx load balancer config in front of Distributor instances(worker_connections, worker_rlimit_nofile).
Enabling of GRPC compression for traffic between distributors and ingesters.

Jakub · Answer 34 · Thu Mar 07 2024 16:59:54 GMT+0800 (China Standard Time)

I still believe there is an issue that could be fixed in cortex. I believe the main reason for my issues was network bandwidth constraints caused by bad WireGuard performance on DigitalOcean network after their maintenance work.

For some reason this network performance issue appeared in metrics as 400s and 500s when pushing from distributors to ingesters. This confused me because I could not find any actual 400 and 500 errors in either distributor or ingester logs.
In hindsight it is possible these log messages were indicating the network performance degradation:

2024/03/06 10:13:50 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".

But I'm not actually sure. It would be useful to have(or learn about) a metric that could indicate network traffic saturation issues.

Ben Ye · Answer 35 · Wed Mar 13 2024 00:01:54 GMT+0800 (China Standard Time)

I still believe there is an issue that could be fixed in cortex

Distributor already had logs like below saying unhealthy Ingesters, which is an indication of some issues of the connection between Ingesters and Distributors. In this case, Distributor doesn't even send the request since not enough live replicas.

  Response: \"at least 2 live replicas required, could only find 0 - unhealthy instances: 10.10.0.214:9095,10.10.0.212:9095,10.10.0.211:9095,10.10.0.218:9095,10.10.0.213:9095\\n\"

Jakub · Answer 36 · Wed Mar 13 2024 00:39:25 GMT+0800 (China Standard Time)

That's a partially fair point, but I still don't understand why graphs was showing thousands of errors while Distributor was not logging any of them. That's a bug. What I also don't get is why was I getting 500s from ingesters when the connection saturation was the issue.

It's possible the payload sent to ingesters didn't arrive in its entirety and could not be processed, but in that case the resulting status code should be 422 Unprocessable Content and not 500 Internal Server Error, to indicate that:

the server understands the content type of the request entity, and the syntax of the request entity is correct, but it was unable to process the contained instructions.

And Distributor should be logging them in full, considering cortex_distributor_ingester_append_failures_total only provides 4xx and 5xx as values for label status, so it's not great for more exact debugging.

What do you think?

Ben Ye · Answer 37 · Wed Mar 13 2024 01:57:04 GMT+0800 (China Standard Time)

@jakubgs I think Distributor still logs the error when the quorum actually fails.
A lot of the times maybe some Ingesters failed to append but write can still succeed when quorum replicas accepts the write.

We can try to log every failed push to Ingester, but it can be excessive.

Jakub · Answer 38 · Wed Mar 13 2024 04:43:46 GMT+0800 (China Standard Time)

Yes, logging every single one would be excessive, I agree, it would be cleaner to have more detailed status values in metrics like cortex_distributor_ingester_append_failures_total instead, since indeed if you are making hundreds or thousands of pushes every minute that would fill the log with mostly noise.

But my point about using 422 instead of 500 in case of partial PUSH delivery still stands. 500 is a generic error that you use when you don't know what to use. If we are more specific with status codes in errors and in metrics we can make debugging specific failure states easier.