"Unable to open TSDB" errors when rolling or scaling up Ingesters
alexinfoblox opened this issue · comments
Describe the bug
A clear and concise description of what the bug is.
We have Cortex 1.15 running in AWS EKS cluster.
Some time back we started to encounter the TSDB error in the ingesters. It leads to CrashLoopBackOff repeated error and prevents ingesters from starting. This happens not all the time but quite frequently and requires manual steps.
Only PVC removal helps.
This can happen with any ingester restart:
- when we do a rolling upgrade of ingesters
- when the ingester is scaled up by HPA
This was a known issue in Prometheus that was fixed.
Ingester logs:
level=error ts=2023-10-27T06:07:17.783133605Z caller=ingester.go:2081 msg="unable to open TSDB" err="failed to open TSDB: /data/ca-3: found unsequential head chunk files /data/ca-3/chunks_head/000245 (index: 245) and /data/ca-3/chunks_head/000419 (index: 419)" user=ca-3
level=error ts=2024-02-12T11:04:44.307677868Z caller=cortex.go:434 msg="module failed" module=ingester-service err="invalid service state: Failed, expected: Running, failure: opening existing TSDBs: unable to open TSDB for user cp-2: failed to open TSDB: /data/cp-2: found unsequential head chunk files /data/cp-2/chunks_head/001799 (index: 1799) and /data/cp-2/chunks_head/001818 (index: 1818)"
TSDB config:
tsdb:
dir: /data
block_ranges_period:
- 2h0m0s
retention_period: 6h0m0s
ship_interval: 1m0s
ship_concurrency: 10
head_compaction_interval: 1m0s
head_compaction_concurrency: 5
head_compaction_idle_timeout: 1h0m0s
head_chunks_write_buffer_size_bytes: 4194304
stripe_size: 16384
wal_compression_enabled: true
wal_segment_size_bytes: 134217728
flush_blocks_on_shutdown: false
close_idle_tsdb_timeout: 0s
head_chunks_write_queue_size: 0
max_tsdb_opening_concurrency_on_startup: 10
max_exemplars: 0
memory_snapshot_on_shutdown: false
out_of_order_cap_max: 32
Is there a way to reproduce the issue on your end?
I think it might be still an issue at Prometheus TSDB side.
@yeya24, I found the root cause of the issue. It's not related to Prometheus TSDB or any Cortex bug. It was due to our manual actions during the rolling update process. Now it seems the issues are gone.
Closing the ticket.
Thank you