Slow running rules from one tenant can cause PrometheusRules API to timeout for all tenants

Question

Slow running rules from one tenant can cause PrometheusRules API to timeout for all tenants

emanlodovice opened this issue 4 months ago · comments

Emmanuel Lodovice commented 4 months ago

Describe the bug
Currently the manager's SyncRuleGroups and GetRules methods share the same lock. This means that if SyncRuleGroups becomes slow then GetRules will have to wait a long time to acquire the lock.

SyncRuleGroups can become slow when we are updating a Rule group with slow running rules because the RuleGroup will wait for the Rule to finish before it stops.

https://github.com/prometheus/prometheus/blob/main/rules/group.go#L249
https://github.com/prometheus/prometheus/blob/main/rules/group.go#L426-L430

Additional Context

Emmanuel Lodovice · Answer 1 · Thu Jan 25 2024 06:53:20 GMT+0800 (China Standard Time)

Maybe we can snapshot the tenant's RuleGroups before updating the manager and we read from have GetRules read from the snapshot when SyncRuleGroups is running

Ben Ye · Answer 2 · Thu Jan 25 2024 09:22:42 GMT+0800 (China Standard Time)

My 2 cents:

It is definitely something we need to fix. GetRules shouldn't be impacted by that user manager lock.
I think it is fine to read the snapshot as you mentioned, we might not have the up-to-date rule groups at each ruler but it is ok since we do eventual consistency. And we have the global rules merge, too.

Emmanuel Lodovice · Answer 3 · Thu Jan 25 2024 10:35:13 GMT+0800 (China Standard Time)

Thanks @yeya24 , I will try to create a PR to address this issue.