Slow running rules from one tenant can cause PrometheusRules API to timeout for all tenants
emanlodovice opened this issue · comments
Describe the bug
Currently the manager's SyncRuleGroups
and GetRules
methods share the same lock. This means that if SyncRuleGroups
becomes slow then GetRules
will have to wait a long time to acquire the lock.
SyncRuleGroups
can become slow when we are updating a Rule group with slow running rules because the RuleGroup will wait for the Rule to finish before it stops.
https://github.com/prometheus/prometheus/blob/main/rules/group.go#L249
https://github.com/prometheus/prometheus/blob/main/rules/group.go#L426-L430
Additional Context
Maybe we can snapshot the tenant's RuleGroups before updating the manager and we read from have GetRules
read from the snapshot when SyncRuleGroups
is running
My 2 cents:
It is definitely something we need to fix. GetRules
shouldn't be impacted by that user manager lock.
I think it is fine to read the snapshot as you mentioned, we might not have the up-to-date rule groups at each ruler but it is ok since we do eventual consistency. And we have the global rules merge, too.
Thanks @yeya24 , I will try to create a PR to address this issue.