cortexproject / cortex

A horizontally scalable, highly available, multi-tenant, long term Prometheus.

Home Page:https://cortexmetrics.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Slow running rules from one tenant can cause PrometheusRules API to timeout for all tenants

emanlodovice opened this issue · comments

Describe the bug
Currently the manager's SyncRuleGroups and GetRules methods share the same lock. This means that if SyncRuleGroups becomes slow then GetRules will have to wait a long time to acquire the lock.

SyncRuleGroups can become slow when we are updating a Rule group with slow running rules because the RuleGroup will wait for the Rule to finish before it stops.

https://github.com/prometheus/prometheus/blob/main/rules/group.go#L249
https://github.com/prometheus/prometheus/blob/main/rules/group.go#L426-L430

Additional Context

Maybe we can snapshot the tenant's RuleGroups before updating the manager and we read from have GetRules read from the snapshot when SyncRuleGroups is running

My 2 cents:

It is definitely something we need to fix. GetRules shouldn't be impacted by that user manager lock.
I think it is fine to read the snapshot as you mentioned, we might not have the up-to-date rule groups at each ruler but it is ok since we do eventual consistency. And we have the global rules merge, too.

Thanks @yeya24 , I will try to create a PR to address this issue.