cortexproject / cortex

A horizontally scalable, highly available, multi-tenant, long term Prometheus.

Home Page:https://cortexmetrics.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

compaction halt when "overlapping sources detected for plan"

AlexandreRoux opened this issue · comments

Describe the bug
Compactor halt compaction when hitting "overlapping sources detected for plan" level=error.

Since the plan will be retry indefinitely no new blocks will be compacted and only solution is to mark block for no-compact using thanos tools bucket.

Although we are using skip_blocks_with_out_of_order_chunks_enabled: true configuration, the block is not being marked as non-compact (possibly because root cause is something else than ooo chuncks).

To Reproduce
Unable to reproduce for now, simply noticed in our cortex environment.

ts=2024-03-07T09:35:55.633278609Z caller=compactor.go:712 level=error component=compactor msg="failed to compact user blocks" user=tenant-1 err="compaction: group 0@1434040103434464048: failed to run pre compaction callback for plan: [01HF36TN8MEB08EXJSK528JHNN (min time: 1699833600000, max time: 1699840800000) 01HF3C12751TH3HFAMKADKNQCX (min time: 1699833600000, max time: 1699840800000) 01HF3C149T2NT6YZ3MDCATMJHE (min time: 1699833600000, max time: 1699840800000) 01HF3C2V5DNFJ8KN93JK99XQSH (min time: 1699833600000, max time: 1699840800000) 01HF3C1238ZQAQD4B18GCZDE8A (min time: 1699833600000, max time: 1699840800000)]: overlapping sources detected for plan [01HF36TN8MEB08EXJSK528JHNN (min time: 1699833600000, max time: 1699840800000) 01HF3C12751TH3HFAMKADKNQCX (min time: 1699833600000, max time: 1699840800000) 01HF3C149T2NT6YZ3MDCATMJHE (min time: 1699833600000, max time: 1699840800000) 01HF3C2V5DNFJ8KN93JK99XQSH (min time: 1699833600000, max time: 1699840800000) 01HF3C1238ZQAQD4B18GCZDE8A (min time: 1699833600000, max time: 1699840800000)]"

Expected behavior
Unsure what expected behavior should be but a skip_blocks_ should should be provided to continue compaction.

Environment:

  • Infrastructure: kubernetes
  • Deployment tool: helm
  • Cortex version: cortex:v1.16.0-rc.0

Can you guys help take a look?
Maybe @danielblando @alexqyle

The error is overlapping sources detected for plan which cannot be skipped by skip_blocks_with_out_of_order_chunks_enabled config.

Based on log, all 5 blocks in this compaction plan are having min time: 1699833600000, max time: 1699840800000. Probably they are having some common source blocks. In this case, it is considered as overlapping blocks. Here is the code doing this overlapping check: https://github.com/cortexproject/cortex/blob/master/vendor/github.com/thanos-io/thanos/pkg/compact/compact.go#L817

Could you please check meta.json of those blocks to validate if there are common source blocks among them?

I had this exact problem today, it happened because 2 compactors were running against same s3 bucket for hours for the same user.
ts=2024-03-14T17:21:42.234667096Z caller=compactor.go:712 level=error component=compactor msg="failed to compact user blocks" user=x err="compaction: group 0@4082378620489593290: failed to run pre compaction callback for plan: [01HRY6GWZ39ZXGPRYR18FV7YQ7 (min time: 1710396000000, max time: 1710403200000) 01HRY538CENVKAQ7G8BA77WJ2X (min time: 1710396000000, max time: 1710403200000)]: overlapping sources detected

Just want to double check. @friedrichg @AlexandreRoux, do you enable out of order samples feature?

@yeya24 No, we don't. We also don't use shuffle sharding in compactors yet. (Cortex v1.16.0)
It's literally caused by running 2 compactors for the same user. We did that by mistake.

I think this might happen if out of order samples is enabled because a single block might be compacted twice and got uploaded to the bucket.
I wonder if it might be related to the shuffle sharding compactor where two compactors compacts the same block.

https://github.com/cortexproject/cortex/releases/tag/v1.17.0-rc.0
Latest version of Cortex is out and it should fix the problem.