cortexproject / cortex

A horizontally scalable, highly available, multi-tenant, long term Prometheus.

Home Page:https://cortexmetrics.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

`Terminated` state results in unhealthy ingesters

mkieweg opened this issue · comments

Describe the bug
When shutting down ingesters they get into Terminated state. This state is considered unexpected by memberlist resulting in the heartbeat to fail and the instance to be tainted as unhealthy. This requires manual intervention and thus effectively breaks autoscaling.

To Reproduce
Steps to reproduce the behavior:

  1. Start Cortex v1.15.3 using Helm chart v2.1.0
  2. Use HPA to scale down Cortex ingesters

Expected behavior
Ingesters should scale down and remove themselves from the ring without errors

Environment:

  • Infrastructure: EKS
  • Deployment tool: Helm chart v2.1.0

Additional Context

Logs

{"caller":"logging.go:76","level":"debug","msg":"GET //ingester/shutdown (301) 73.436µs","traceID":"1bf635dc8c6c3d4e","ts":"2023-11-21T16:41:39.79265651Z"}
{"caller":"lifecycler.go:498","level":"info","msg":"lifecycler loop() exited gracefully","ring":"ingester","ts":"2023-11-21T16:41:39.8043733Z"}
{"caller":"lifecycler.go:811","level":"info","msg":"changing instance state from","new_state":"LEAVING","old_state":"ACTIVE","ring":"ingester","ts":"2023-11-21T16:41:39.804427334Z"}
{"caller":"ingester.go:2586","level":"info","msg":"starting to flush and ship TSDB blocks","ts":"2023-11-21T16:41:39.804546549Z"}
{"caller":"compact.go:519","duration":"234.25592ms","level":"info","maxt":1700582400000,"mint":1700581137870,"msg":"write block","org_id":"fake","ts":"2023-11-21T16:41:40.038875302Z","ulid":"01HFSC4H6WJD5XV7H90F0P6D4V"}
{"block":"01HEQEWTXD8ZKSDDDE9071TP70","caller":"db.go:1550","level":"info","msg":"Deleting obsolete block","org_id":"fake","ts":"2023-11-21T16:41:40.042351899Z"}
{"block":"01HEQY3KTDAJA0TJHPRCZ0MBQN","caller":"db.go:1550","level":"info","msg":"Deleting obsolete block","org_id":"fake","ts":"2023-11-21T16:41:40.046284574Z"}
{"block":"01HEQHJBED85MPHZTEXSAS9SYD","caller":"db.go:1550","level":"info","msg":"Deleting obsolete block","org_id":"fake","ts":"2023-11-21T16:41:40.049584673Z"}
{"block":"01HEQEWW1KRNEBKS9K4Y42RVMT","caller":"db.go:1550","level":"info","msg":"Deleting obsolete block","org_id":"fake","ts":"2023-11-21T16:41:40.052457795Z"}
{"caller":"truncateMemory","duration":"52.163691ms","level":"info","msg":"Head GC completed","org_id":"fake","ts":"2023-11-21T16:41:40.104683711Z"}
{"caller":"memberlist_logger.go:74","level":"debug","msg":"Stream connection from=127.0.0.6:54087","ts":"2023-11-21T16:41:40.10990833Z"}
{"caller":"memberlist_logger.go:74","level":"debug","msg":"Failed ping: cortex-store-gateway-1-1a5d9a43 (timeout reached)","ts":"2023-11-21T16:41:40.892540819Z"}
{"caller":"grpc_logging.go:46","duration":"76.461µs","level":"debug","method":"/grpc.health.v1.Health/Check","msg":"gRPC (success)","ts":"2023-11-21T16:41:40.927996371Z"}
{"caller":"compact.go:519","duration":"1.423570173s","level":"info","maxt":1700584899375,"mint":1700582400000,"msg":"write block","org_id":"fake","ts":"2023-11-21T16:41:41.528432979Z","ulid":"01HFSC4HG89P99GJVSEBSTFP1K"}
{"caller":"truncateMemory","duration":"202.667137ms","level":"info","msg":"Head GC completed","org_id":"fake","ts":"2023-11-21T16:41:41.732417054Z"}
{"caller":"checkpoint.go:100","from_segment":578,"level":"info","mint":1700584899375,"msg":"Creating checkpoint","org_id":"fake","to_segment":579,"ts":"2023-11-21T16:41:41.732951452Z"}
{"caller":"memberlist_logger.go:74","level":"debug","msg":"Stream connection from=127.0.0.6:58933","ts":"2023-11-21T16:41:41.979575777Z"}
{"caller":"head.go:1240","duration":"1.523683363s","first":578,"last":579,"level":"info","msg":"WAL checkpoint complete","org_id":"fake","ts":"2023-11-21T16:41:43.256181134Z"}
{"caller":"ingester.go:2368","compactReason":"forced","level":"debug","msg":"TSDB blocks compaction completed successfully","ts":"2023-11-21T16:41:43.256293661Z","user":"fake"}
{"caller":"shipper.go:334","id":"01HFSC4H6WJD5XV7H90F0P6D4V","level":"info","msg":"upload new block","org_id":"fake","ts":"2023-11-21T16:41:43.301936682Z"}
{"bucket":"tracing: cortex-cortex-stg-us-west-2","caller":"objstore.go:288","dst":"01HFSC4H6WJD5XV7H90F0P6D4V/chunks/000001","from":"/data/tsdb/fake/thanos/upload/01HFSC4H6WJD5XV7H90F0P6D4V/chunks/000001","level":"debug","msg":"uploaded file","org_id":"fake","ts":"2023-11-21T16:41:43.333067008Z"}
{"bucket":"tracing: cortex-cortex-stg-us-west-2","caller":"objstore.go:288","dst":"01HFSC4H6WJD5XV7H90F0P6D4V/index","from":"/data/tsdb/fake/thanos/upload/01HFSC4H6WJD5XV7H90F0P6D4V/index","level":"debug","msg":"uploaded file","org_id":"fake","ts":"2023-11-21T16:41:43.427698215Z"}
{"caller":"shipper.go:334","id":"01HFSC4HG89P99GJVSEBSTFP1K","level":"info","msg":"upload new block","org_id":"fake","ts":"2023-11-21T16:41:43.500269397Z"}
{"bucket":"tracing: cortex-cortex-stg-us-west-2","caller":"objstore.go:288","dst":"01HFSC4HG89P99GJVSEBSTFP1K/chunks/000001","from":"/data/tsdb/fake/thanos/upload/01HFSC4HG89P99GJVSEBSTFP1K/chunks/000001","level":"debug","msg":"uploaded file","org_id":"fake","ts":"2023-11-21T16:41:43.660061181Z"}
{"bucket":"tracing: cortex-cortex-stg-us-west-2","caller":"objstore.go:288","dst":"01HFSC4HG89P99GJVSEBSTFP1K/index","from":"/data/tsdb/fake/thanos/upload/01HFSC4HG89P99GJVSEBSTFP1K/index","level":"debug","msg":"uploaded file","org_id":"fake","ts":"2023-11-21T16:41:43.856623646Z"}
{"caller":"memberlist_logger.go:74","level":"warn","msg":"Was able to connect to cortex-store-gateway-1-1a5d9a43 but other probes failed, network may be misconfigured","ts":"2023-11-21T16:41:43.890882572Z"}
{"caller":"ingester.go:2279","level":"debug","msg":"shipper successfully synchronized TSDB blocks with storage","ts":"2023-11-21T16:41:43.984722874Z","uploaded":2,"user":"fake"}
{"caller":"ingester.go:2595","level":"info","msg":"finished flushing and shipping TSDB blocks","ts":"2023-11-21T16:41:43.984859001Z"}
{"caller":"lifecycler.go:871","final_sleep":"30s","level":"info","msg":"lifecycler entering final sleep before shutdown","ts":"2023-11-21T16:41:43.985246801Z"}
{"caller":"signals.go:55","level":"info","msg":"=== received SIGINT/SIGTERM ===\n*** exiting","ts":"2023-11-21T16:41:44.816310571Z"}
{"caller":"module_service.go:96","level":"info","module":"ingester-service","msg":"module stopped","ts":"2023-11-21T16:41:44.816429019Z"}
{"caller":"module_service.go:86","level":"debug","module":"server","msg":"stopping","ts":"2023-11-21T16:41:44.816563052Z"}
{"caller":"module_service.go:109","level":"debug","module":"runtime-config","msg":"module waiting for","ts":"2023-11-21T16:41:44.816598457Z","waiting_for":"ingester-service"}
{"caller":"module_service.go:86","level":"debug","module":"runtime-config","msg":"stopping","ts":"2023-11-21T16:41:44.816632226Z"}
{"caller":"module_service.go:96","level":"info","module":"runtime-config","msg":"module stopped","ts":"2023-11-21T16:41:44.816643075Z"}
{"caller":"module_service.go:109","level":"debug","module":"memberlist-kv","msg":"module waiting for","ts":"2023-11-21T16:41:44.816657622Z","waiting_for":"ingester-service"}
{"caller":"module_service.go:86","level":"debug","module":"memberlist-kv","msg":"stopping","ts":"2023-11-21T16:41:44.816672603Z"}
{"caller":"memberlist_client.go:612","level":"info","msg":"leaving memberlist cluster","ts":"2023-11-21T16:41:44.816698917Z"}
{"caller":"module_service.go:96","level":"info","module":"memberlist-kv","msg":"module stopped","ts":"2023-11-21T16:41:45.841625286Z"}
{"caller":"memberlist_logger.go:74","level":"debug","msg":"Failed ping: cortex-distributor-7d7d5b59b8-9t7ks-7824768a (timeout reached)","ts":"2023-11-21T16:41:45.89149416Z"}
{"caller":"memberlist_logger.go:74","level":"info","msg":"Suspect cortex-distributor-7d7d5b59b8-9t7ks-7824768a has failed, no acks received","ts":"2023-11-21T16:41:48.891631962Z"}
{"caller":"lifecycler.go:538","err":"unexpected state: Terminated","level":"error","msg":"failed to write to the KV store, sleeping","ring":"ingester","ts":"2023-11-21T16:41:49.804559785Z"}
{"caller":"lifecycler.go:538","err":"unexpected state: Terminated","level":"error","msg":"failed to write to the KV store, sleeping","ring":"ingester","ts":"2023-11-21T16:41:54.804679488Z"}
{"caller":"lifecycler.go:538","err":"unexpected state: Terminated","level":"error","msg":"failed to write to the KV store, sleeping","ring":"ingester","ts":"2023-11-21T16:41:59.805094041Z"}
{"caller":"lifecycler.go:538","err":"unexpected state: Terminated","level":"error","msg":"failed to write to the KV store, sleeping","ring":"ingester","ts":"2023-11-21T16:42:04.805275687Z"}
{"caller":"lifecycler.go:538","err":"unexpected state: Terminated","level":"error","msg":"failed to write to the KV store, sleeping","ring":"ingester","ts":"2023-11-21T16:42:09.805392347Z"}
{"caller":"lifecycler.go:877","level":"debug","msg":"unregistering instance from ring","ring":"ingester","ts":"2023-11-21T16:42:13.986349184Z"}
{"caller":"ingester.go:772","err":"failed to unregister from the KV store, ring: ingester: unexpected state: Terminated","level":"warn","msg":"failed to stop ingester lifecycler","ts":"2023-11-21T16:42:13.986629129Z"}
{"caller":"logging.go:76","level":"debug","msg":"GET /ingester/shutdown (204) 34.185574054s","traceID":"1a7457db41a31f14","ts":"2023-11-21T16:42:13.989845983Z"}
{"caller":"server_service.go:50","level":"info","msg":"server stopped","ts":"2023-11-21T16:42:14.148840428Z"}
{"caller":"module_service.go:96","level":"info","module":"server","msg":"module stopped","ts":"2023-11-21T16:42:14.148922944Z"}
{"caller":"cortex.go:423","level":"info","msg":"Cortex stopped","ts":"2023-11-21T16:42:14.148952283Z"}