Metrics show append failures but logs show none
jakubgs opened this issue · comments
Describe the bug
I'm seeing lots of errors from cortex_distributor_ingester_append_failures_total
metric on one of our distributors:
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.211:9095",status="4xx",type="samples"} 318
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.211:9095",status="5xx",type="metadata"} 14
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.211:9095",status="5xx",type="samples"} 1670
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.212:9095",status="4xx",type="samples"} 248
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.212:9095",status="5xx",type="metadata"} 13
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.212:9095",status="5xx",type="samples"} 1991
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.213:9095",status="4xx",type="samples"} 68
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.213:9095",status="5xx",type="samples"} 23041
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.214:9095",status="4xx",type="samples"} 128
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.214:9095",status="5xx",type="metadata"} 44
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.214:9095",status="5xx",type="samples"} 5642
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.218:9095",status="4xx",type="samples"} 97
cortex_distributor_ingester_append_failures_total{ingester="10.10.0.218:9095",status="5xx",type="samples"} 36903
These errors can be seen in the graph:
But when I log onto affected distributor and ingester I cannot find any errors logged, even on debug level:
ts=2024-03-06T09:53:16.116085832Z caller=grpc_logging.go:46 level=debug method=/cortex.Ingester/Push duration=4.272868ms msg="gRPC (success)"
ts=2024-03-06T09:53:16.133388069Z caller=grpc_logging.go:46 level=debug method=/cortex.Ingester/Push duration=5.270535ms msg="gRPC (success)"
ts=2024-03-06T09:53:16.146464557Z caller=grpc_logging.go:46 level=debug method=/cortex.Ingester/Push duration=4.954335ms msg="gRPC (success)"
ts=2024-03-06T09:53:16.157419974Z caller=grpc_logging.go:46 level=debug method=/cortex.Ingester/Push duration=4.418061ms msg="gRPC (success)"
To Reproduce
I have no idea.
Expected behavior
Either Distributor or Ingester shows errors in the logs so the issue can be actually debugged.
Environment:
Prometheus 2.50.1
sending to Cortex 1.16.0
.
As recommended by @friedrichg I made this query for affected ingester:
histogram_quantile(0.99,sum(rate(cortex_request_duration_seconds_bucket{container="ingester",route="/cortex.Ingester/Push"}[5m])) by (le,route,status_code))
But the ingester appeared to be healthy:
Then it was recommended to query this for the affected distributor:
histogram_quantile(0.99,sum(rate(cortex_request_duration_seconds_bucket{container="distributor"}[5m])) by (le,route,status_code))
And the result was:
Which shows 400 errors for push requests. But no 400 errors were logged by distributor set to debug log level.
Share
sum(rate(cortex_request_duration_seconds_bucket{container="distributor"}[5m])) by (route,status_code)
If you can
The only two warnings I managed to find were:
ts=2024-03-06T09:40:15.643029269Z caller=push.go:53 level=warn org_id=fake msg="push refused" err="rpc error: code = Code(400) desc = maxFailure (quorum) on a given error family, rpc error: code = Code(400) desc = addr=10.10.0.211:9095 state=ACTIVE zone=, rpc error: code = Code(400) desc = user=fake: err: duplicate sample for timestamp. timestamp=2024-03-06T09:40:09.183Z, series={__name__=\"consul_consul_members_clients\", datacenter=\"aws-eu-central-1a\", fleet=\"consul.hq\", group=\",consul.hq,metrics,consul,\", instance=\"node-03.aws-eu-central-1a.consul.hq\", job=\"consul-metrics\", source=\"slave-01.aws-eu-central-1a.metrics.hq\"}"
ts=2024-03-06T09:57:16.640627601Z caller=push.go:53 level=warn org_id=fake msg="push refused" err="rpc error: code = Code(400) desc = maxFailure (quorum) on a given error family, rpc error: code = Code(400) desc = addr=10.10.0.211:9095 state=ACTIVE zone=, rpc error: code = Code(400) desc = user=fake: err: duplicate sample for timestamp. timestamp=2024-03-06T09:57:07.986Z, series={__name__=\"consul_consul_state_nodes\", datacenter=\"aws-eu-central-1a\", fleet=\"consul.hq\", group=\",consul.hq,metrics,consul,\", instance=\"node-01.aws-eu-central-1a.consul.hq\", job=\"consul-metrics\", source=\"slave-01.aws-eu-central-1a.metrics.hq\"}"
But that doesn't match with number of append failures in the distributor metric.
Not sure if it matters but distributors sometimes print stuff like this:
2024/03/06 10:13:50 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
2024/03/06 10:13:50 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
2024/03/06 10:16:30 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
2024/03/06 10:16:30 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
Cannot find any issues about it in this repo.
Eventually the cluster explodes and distributor logs are full of this:
ts=2024-03-06T10:22:34.727004737Z
caller=logging.go:86
level=warn
traceID=300fd752f4382b17
msg="
POST /api/v1/push (500) 7.643734ms
Response: \"at least 2 live replicas required, could only find 0 - unhealthy instances: 10.10.0.214:9095,10.10.0.212:9095,10.10.0.211:9095,10.10.0.218:9095,10.10.0.213:9095\\n\"
Which honestly doesn't look like a warning
level message and more like a sever error
level message.
What baffles me tho is that ingester instances at no point were overloaded, the are mostly idle:
And the load never went above 2.0 on a machine with 8 cores:
So why does it suddenly become not "alive"? No idea.
@jakubgs can you share your distributor config too? thanks
Sure:
---
target: 'distributor'
auth_enabled: false
# ---------------------- Configs --------------------------
configs:
database:
uri: 'memory://'
# ---------------------- Limits ---------------------------
limits:
ingestion_rate: 1000000
ingestion_burst_size: 2000000
# Max active metrics with meta per user, per ingester.
max_metadata_per_user: 128000
# Limit impact from high cardinality.
max_series_per_metric: 60000
max_series_per_user: 5000000
max_label_names_per_series: 30
# Delete blocks containing samples older than this.
compactor_blocks_retention_period: 0
# Maximum accepted sample age before rejecting.
reject_old_samples: true
reject_old_samples_max_age: 10m
# Allowed time window for ingestion of out-of-order samples.
out_of_order_time_window: 5m
# ---------------------- Server ---------------------------
server:
http_listen_address: '0.0.0.0'
http_listen_port: 9092
grpc_listen_address: '0.0.0.0'
grpc_listen_port: 9095
log_level: 'info'
# Big queries need bigger message size.
grpc_server_max_recv_msg_size: 104857600
grpc_server_max_send_msg_size: 16777216
# Bump gRPC concurrency to avoid delays.
grpc_server_max_concurrent_streams: 1000
# ---------------------- Storage --------------------------
storage:
engine: 'blocks'
# ---------------------- Distributor ----------------------
distributor:
# Low values cause `context deadline exceeded` on push to ingesters.
remote_timeout: '30s'
ring:
kvstore:
store: etcd
etcd:
username: cortex
password: SECRET
endpoints: ['10.10.0.13:2379', '10.10.0.14:2379', '10.10.0.15:2379']
# ---------------------- Ingester -------------------------
ingester:
lifecycler:
tokens_file_path: '/var/tmp/cortex/wal/tokens'
interface_names: [wg0]
ring:
kvstore:
store: etcd
etcd:
username: cortex
password: SECRET
endpoints: ['10.10.0.13:2379', '10.10.0.14:2379', '10.10.0.15:2379']
# Big queries need bigger message size.
grpc_server_max_recv_msg_size: 104857600
grpc_server_max_send_msg_size: 16777216
# Bump gRPC concurrency to avoid delays.
grpc_server_max_concurrent_streams: 1000
I have never changed that. I ran clusters all sizes without ever touching that.
I use https://github.com/cortexproject/cortex-jsonnet/blob/main/cortex/distributor.libsonnet#L24-L26 for distributors
It's needed for ingesters though https://github.com/cortexproject/cortex-jsonnet/blob/main/cortex/ingester.libsonnet#L34-L36
I can remove it from distributors config, but I doubt that has any bearing on the latency.
I have applied your suggested changes to GRPC server configuration, but the latency issues persist.
From your message 10.10.0.214:9095,10.10.0.212:9095,10.10.0.211:9095,10.10.0.218:9095,10.10.0.213:9095 are the problematic ingesters. it might be worth checking if it is always the same ingesters in the message
Those are all my ingesters.
And when everything crashes distributor logs are full of this:
ts=2024-03-06T14:53:06.762792796Z caller=logging.go:86 level=warn traceID=33cb0c7912c40196
msg="POST /api/v1/push (500) 102.578691ms Response: \"No metric name label\\n\" ws: false; Content-Encoding: snappy; Content-Length: 51463; Content-Type: application/x-protobuf; Retry-Attempt: 50; User-Agent: Prometheus/2.50.1; X-Forwarded-For: 172.17.2.2; X-Forwarded-Proto: http; X-Prometheus-Remote-Write-Version: 0.1.0; X-Real-Ip: 172.17.2.2; "
No idea why. Related issue:
Restarting Prometheus makes it go away. No idea why.
I also get this when eventually all comes crashing down:
ts=2024-03-06T15:00:20.39654373Z caller=push.go:34 level=error org_id=fake err="unexpected EOF"
ts=2024-03-06T15:00:20.42643706Z caller=push.go:34 level=error org_id=fake err="unexpected EOF"
ts=2024-03-06T15:00:20.438756524Z caller=push.go:34 level=error org_id=fake err="unexpected EOF"
What I also wonder is why are distributor hosts generating 10x the amount of outgoing traffic compared to incoming:
That seems the case. From your configuration, I don't see you enable gRPC compression on the gRPC client to ingester. -ingester.client.grpc-compression
.
Response: "at least 2 live replicas required, could only find 0 - unhealthy instances: 10.10.0.214:9095,10.10.0.212:9095,10.10.0.211:9095,10.10.0.218:9095,10.10.0.213:9095\n"
It seems that your Ingesters failed Health check. The reason of the failure could vary. They will be considered unhealthy if they don't heartbeat of Ring kvstore (etcd in this case) in time. I recommend you to check the Ingester Ring status.
There is a web UI exposed and you can look at whether those ingesters are healthy or not.
@yeya24 but I see no compression flag for distributors, these are the only ones I can find:
https://cortexmetrics.io/docs/configuration/configuration-file/
# Use compression when sending messages. Supported values are: 'gzip',
# 'snappy', 'snappy-block' ,'zstd' and '' (disable compression)
# CLI flag: -query-scheduler.grpc-client-config.grpc-compression
[grpc_compression: <string> | default = ""]
# Use compression when sending messages. Supported values are: 'gzip',
# 'snappy' and '' (disable compression)
# CLI flag: -alertmanager.alertmanager-client.grpc-compression
[grpc_compression: <string> | default = ""]
# Use compression when sending messages. Supported values are: 'gzip',
# 'snappy', 'snappy-block' ,'zstd' and '' (disable compression)
# CLI flag: -querier.frontend-client.grpc-compression
[grpc_compression: <string> | default = ""]
# Use compression when sending messages. Supported values are: 'gzip',
# 'snappy', 'snappy-block' ,'zstd' and '' (disable compression)
# CLI flag: -ingester.client.grpc-compression
[grpc_compression: <string> | default = ""]
# Use compression when sending messages. Supported values are: 'gzip',
# 'snappy' and '' (disable compression)
# CLI flag: -querier.store-gateway-client.grpc-compression
[grpc_compression: <string> | default = ""]
# Use compression when sending messages. Supported values are: 'gzip',
# 'snappy', 'snappy-block' ,'zstd' and '' (disable compression)
# CLI flag: -frontend.grpc-client-config.grpc-compression
[grpc_compression: <string> | default = ""]
# Use compression when sending messages. Supported values are: 'gzip',
# 'snappy', 'snappy-block' ,'zstd' and '' (disable compression)
# CLI flag: -ruler.client.grpc-compression
[grpc_compression: <string> | default = ""]
Is it just not documented?
You're suggesting to use ingester.client.grpc-compression
, but the documentation says:
# Use compression when sending messages. Supported values are: 'gzip',
# 'snappy', 'snappy-block' ,'zstd' and '' (disable compression)
But isn't the distributor sending messages to ingester and not ingester sending them to distributor? Confusing.
Also, which type of compression do you recommend?
Oh, I see, the ingester_client
configuration is different from ingster
configuration.
Any good reason why compression isn't enabled by default?
Any good reason why compression isn't enabled by default?
Compression could cause CPU increase on both distributor and ingester, which might cause some unexpected issue.
We can think about if we want to turn it on by default.
One thing worth noting that I discovered was that we had abysmal WireGuard VPN performance, which we use to send metrics between hosts in our metrics fleet:
Connecting to host ingest-05.do-ams3.metrics.hq.wg, port 9092
[ 5] local 10.10.0.73 port 37178 connected to 10.10.0.218 port 9092
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 9.35 MBytes 78.5 Mbits/sec 0 172 KBytes
[ 5] 1.00-2.00 sec 9.32 MBytes 78.2 Mbits/sec 0 180 KBytes
[ 5] 2.00-3.00 sec 8.83 MBytes 74.1 Mbits/sec 0 180 KBytes
[ 5] 3.00-4.00 sec 8.83 MBytes 74.1 Mbits/sec 0 180 KBytes
[ 5] 4.00-5.00 sec 7.73 MBytes 64.8 Mbits/sec 0 180 KBytes
[ 5] 5.00-6.00 sec 6.99 MBytes 58.6 Mbits/sec 0 180 KBytes
[ 5] 6.00-7.00 sec 7.30 MBytes 61.2 Mbits/sec 0 180 KBytes
[ 5] 7.00-8.00 sec 7.85 MBytes 65.9 Mbits/sec 0 180 KBytes
[ 5] 8.00-9.00 sec 8.03 MBytes 67.4 Mbits/sec 0 180 KBytes
[ 5] 9.00-10.00 sec 8.40 MBytes 70.5 Mbits/sec 0 180 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 82.6 MBytes 69.3 Mbits/sec 0 sender
[ 5] 0.00-10.04 sec 82.2 MBytes 68.7 Mbits/sec receiver
As compared to performance without WireGuard:
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 262 MBytes 2.20 Gbits/sec 5719 144 KBytes
[ 5] 1.00-2.00 sec 240 MBytes 2.01 Gbits/sec 3349 126 KBytes
[ 5] 2.00-3.00 sec 239 MBytes 2.00 Gbits/sec 3490 293 KBytes
[ 5] 3.00-4.00 sec 238 MBytes 1.99 Gbits/sec 9508 126 KBytes
[ 5] 4.00-5.00 sec 238 MBytes 1.99 Gbits/sec 10334 294 KBytes
[ 5] 5.00-6.00 sec 239 MBytes 2.00 Gbits/sec 10028 182 KBytes
[ 5] 6.00-7.00 sec 239 MBytes 2.00 Gbits/sec 8392 226 KBytes
[ 5] 7.00-8.00 sec 232 MBytes 1.95 Gbits/sec 4763 250 KBytes
[ 5] 8.00-9.00 sec 242 MBytes 2.03 Gbits/sec 3059 168 KBytes
[ 5] 9.00-10.00 sec 236 MBytes 1.98 Gbits/sec 3851 228 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 2.35 GBytes 2.02 Gbits/sec 62493 sender
[ 5] 0.00-10.02 sec 2.35 GBytes 2.01 Gbits/sec receiver
Which is most probably to most recent DigitalOcean network maintenance they did in our DC last week.
This suggests to me that the 400 errors from ingesters distributors were getting were timeouts not due to ingester latency or I/O issues, but purely because wireguard VPN was throttling the network connection, dropping some packets, and delivering others way too late.
I'm not entirely sure how exactly this was causing 400s and 500s, but it seems like the most probably culprit in this case, as it was definitely not host CPU load or memory usage.
As far as I can tell my issue has been resolved and my cluster is now stable:
The fix involved several changes, but the biggest one being the GRPC compression:
- Removal of secondary KV Store (Consul) as unnecessary (and also slower).
- Discovery Distributors were using secondary KV Store instead of Primary due to mis-configuration.
- Fixing of WireGuard VPN setup to avoid unnecessary restarts and instead reload configuration.
- Fine-tuning of Nginx load balancer config in front of Distributor instances(
worker_connections
,worker_rlimit_nofile
). - Enabling of GRPC compression for traffic between distributors and ingesters.
I still believe there is an issue that could be fixed in cortex. I believe the main reason for my issues was network bandwidth constraints caused by bad WireGuard performance on DigitalOcean network after their maintenance work.
For some reason this network performance issue appeared in metrics as 400s and 500s when pushing from distributors to ingesters. This confused me because I could not find any actual 400 and 500 errors in either distributor or ingester logs.
In hindsight it is possible these log messages were indicating the network performance degradation:
2024/03/06 10:13:50 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
But I'm not actually sure. It would be useful to have(or learn about) a metric that could indicate network traffic saturation issues.
I still believe there is an issue that could be fixed in cortex
Distributor already had logs like below saying unhealthy Ingesters, which is an indication of some issues of the connection between Ingesters and Distributors. In this case, Distributor doesn't even send the request since not enough live replicas.
Response: \"at least 2 live replicas required, could only find 0 - unhealthy instances: 10.10.0.214:9095,10.10.0.212:9095,10.10.0.211:9095,10.10.0.218:9095,10.10.0.213:9095\\n\"
That's a partially fair point, but I still don't understand why graphs was showing thousands of errors while Distributor was not logging any of them. That's a bug. What I also don't get is why was I getting 500s from ingesters when the connection saturation was the issue.
It's possible the payload sent to ingesters didn't arrive in its entirety and could not be processed, but in that case the resulting status code should be 422 Unprocessable Content
and not 500 Internal Server Error
, to indicate that:
the server understands the content type of the request entity, and the syntax of the request entity is correct, but it was unable to process the contained instructions.
And Distributor should be logging them in full, considering cortex_distributor_ingester_append_failures_total
only provides 4xx
and 5xx
as values for label status
, so it's not great for more exact debugging.
What do you think?
@jakubgs I think Distributor still logs the error when the quorum actually fails.
A lot of the times maybe some Ingesters failed to append but write can still succeed when quorum replicas accepts the write.
We can try to log every failed push to Ingester, but it can be excessive.
Yes, logging every single one would be excessive, I agree, it would be cleaner to have more detailed status
values in metrics like cortex_distributor_ingester_append_failures_total
instead, since indeed if you are making hundreds or thousands of pushes every minute that would fill the log with mostly noise.
But my point about using 422
instead of 500
in case of partial PUSH
delivery still stands. 500
is a generic error that you use when you don't know what to use. If we are more specific with status codes in errors and in metrics we can make debugging specific failure states easier.