"Unable to open TSDB" errors when rolling or scaling up Ingesters

Question

"Unable to open TSDB" errors when rolling or scaling up Ingesters

alexinfoblox opened this issue 2 months ago · comments

Aliaksandr Serabkou commented 2 months ago

Describe the bug
A clear and concise description of what the bug is.

We have Cortex 1.15 running in AWS EKS cluster.
Some time back we started to encounter the TSDB error in the ingesters. It leads to CrashLoopBackOff repeated error and prevents ingesters from starting. This happens not all the time but quite frequently and requires manual steps.
Only PVC removal helps.
This can happen with any ingester restart:

when we do a rolling upgrade of ingesters
when the ingester is scaled up by HPA

This was a known issue in Prometheus that was fixed.
Ingester logs:

level=error ts=2023-10-27T06:07:17.783133605Z caller=ingester.go:2081 msg="unable to open TSDB" err="failed to open TSDB: /data/ca-3: found unsequential head chunk files /data/ca-3/chunks_head/000245 (index: 245) and /data/ca-3/chunks_head/000419 (index: 419)" user=ca-3

level=error ts=2024-02-12T11:04:44.307677868Z caller=cortex.go:434 msg="module failed" module=ingester-service err="invalid service state: Failed, expected: Running, failure: opening existing TSDBs: unable to open TSDB for user cp-2: failed to open TSDB: /data/cp-2: found unsequential head chunk files /data/cp-2/chunks_head/001799 (index: 1799) and /data/cp-2/chunks_head/001818 (index: 1818)"

TSDB config:

tsdb:
    dir: /data
    block_ranges_period:
    - 2h0m0s
    retention_period: 6h0m0s
    ship_interval: 1m0s
    ship_concurrency: 10
    head_compaction_interval: 1m0s
    head_compaction_concurrency: 5
    head_compaction_idle_timeout: 1h0m0s
    head_chunks_write_buffer_size_bytes: 4194304
    stripe_size: 16384
    wal_compression_enabled: true
    wal_segment_size_bytes: 134217728
    flush_blocks_on_shutdown: false
    close_idle_tsdb_timeout: 0s
    head_chunks_write_queue_size: 0
    max_tsdb_opening_concurrency_on_startup: 10
    max_exemplars: 0
    memory_snapshot_on_shutdown: false
    out_of_order_cap_max: 32

Ben Ye · Answer 1 · Sat Apr 06 2024 02:22:38 GMT+0800 (China Standard Time)

Is there a way to reproduce the issue on your end?
I think it might be still an issue at Prometheus TSDB side.

Aliaksandr Serabkou · Answer 2 · Tue Apr 16 2024 17:17:16 GMT+0800 (China Standard Time)

@yeya24, I found the root cause of the issue. It's not related to Prometheus TSDB or any Cortex bug. It was due to our manual actions during the rolling update process. Now it seems the issues are gone.
Closing the ticket.
Thank you