Better mechanism to detect impact in terms of the number of rule groups when rulers become unhealthy.

Question

Better mechanism to detect impact in terms of the number of rule groups when rulers become unhealthy.

emanlodovice opened this issue 2 months ago · comments

Emmanuel Lodovice commented 2 months ago

Is your feature request related to a problem? Please describe.
Currently one way to count the number of rule groups for a given tenant is to count the unique rule_group labels using any of the per rule group metrics such as cortex_prometheus_rule_group_rules. This gives us an accurate count of rule groups per tenant when all rulers are up and running. But in the event when rulers become unhealthy we will not get metrics from the unhealthy rulers so the count of unique rule_group labels using any of the per rule group metric will not be an accurate number anymore. And because there is no metric containing the exact count of rule groups per tenant in the storage it is very difficult to determine the impact in terms of number of affected rule groups when a ruler becomes unhealthy (or when rulers did not load specific rule groups maybe during resharding).

Describe the solution you'd like
Create a new metric for the count of rule groups per tenant in the storage. All rulers can emit this metric for all tenants that includes it in its sub-ring so we don't lose the metric when some rulers go down. The count of the rule groups per tenant is already available during sync rules operation https://github.com/cortexproject/cortex/blob/master/pkg/ruler/ruler.go#L688

Emmanuel Lodovice · Answer 1 · Fri Apr 19 2024 07:07:00 GMT+0800 (China Standard Time)

I can work on this.