cortexproject / cortex

A horizontally scalable, highly available, multi-tenant, long term Prometheus.

Home Page:https://cortexmetrics.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Ruler API HA

emanlodovice opened this issue · comments

Is your feature request related to a problem? Please describe.
Currently the ReplicationFactor for rulers is hard coded to 1.

Loading each rule group to just 1 ruler presents a problem on the Rules API availability as presented in #4435

Right now, the Ruler return 5XX in the API if there is an outage in at least one ruler instance. I am assuming this is because rulers fail to return a complete list of rule groups to the caller.

Describe the solution you'd like
If a ruler restarts, it will loose the state of the rule groups it is running. By state I mean information like Alerts, Health, EvaluationDuration, LastError, etc. The state only gets set when the rule group evaluates which can be minutes after the ruler starts because it depends on the rule group interval.

While rulers don't have evaluation HA, since rule group states are lost after ruler restart/reshard, we can have Rules API HA by allowing for a higher ReplicationFactor for rulers but only the first ruler will evaluate the rule group and the rest of the replica will just load the rule group for the sake of having the rule group information to respond to API calls. This means that if we have ReplicationFactor set to 3, 3 rulers will have load the rule group but only 1 will evaluate.

On the API handler, we return the rule groups that the ruler is evaluating and the rule groups that ruler loads but NOT evaluating and de-duplicate the resulting list by selecting the rule group information with the latest LastEvaluation value. This way, the rule group information coming from the ruler evaluating the rule group will always be selected, but if that ruler has an outage we can still return the rule groups that it is evaluating because they are loaded by other rulers, but with a blank state.

Sample pseudo code of the idea assuming replication factor set to 3 with AZ awareness enabled:

rule_groups_to_evaluation = []
rule_groups_to_backup = []
for rule_group in rule_groups_from_s3:
    hash = tokenForGroup(rule_group)
   rulers = ring.Get(hash, RingOp)
   if rulers[0].Addr == curInstanceAddr:
       rule_groups_to_evaluation.add(rule_group)
   else if rulers[1].Addr == curInstanceAddr || rulers[2].Addr == curInstanceAddr:
       rule_groups_to_backup.add(rule_group)
function GetRules() {
   // getLocalRules currently exists in cortex and it returns the rules that
   // are evaluating
     rule_groups = getLocalRules()
    for rule_group in rule_groups_to_backup:
        rule_groups.add(rule_group)
    return rule_groups
}


function ListRules() {
     rulers = ring.GetReplicationSet()
    rule_groups = []
    failure_az = set()
    for ruler in rulers:
        client = clientPool.GetClientFor(ruler.Addr)
        states, err = client.GetRules()
        if err != nil:
            failure_az.add(ruler.AZ)
        else:
            rule_groups.join(states)
    if len(failure_az) > 1:
        return err
    remove_duplicates(rule_groups)
    return rule_groups
}

Describe alternatives you've considered
An alternative solution that was considered was to store the state of the rule groups to a persistent storage like an sql database. The rulers will write the state to this database every rule evaluation. Then the Rules API can just read off of this database instead of doing a fan out request to all rulers in the ring.

But the unpredictable nature of alerts in alerting rules could result to huge amount of data written to the database which could negatively affect performance. Also adding a database is a huge commitment could become problematic in the future when we have to adjust our data formats

Additional context
Add any other context or screenshots about the feature request here.

@emanlodovice

If we restart the first ruler, we can query API but do not get any recording rules or evaluation. Most of the work a ruler does is the evaluation, we are going to have lots of unused resources in those extra rulers.

I don't like the database approach that approach is deprecated for many reasons already. So that is fine.

Feels to me, if we had HA rulers, we could already fix this problem.
https://cortexmetrics.io/docs/proposals/ruler-ha/ was the proposal to fix evaluations and API issues

This suggestion tries to fix only the API problem, which is fine if it brought us one step forward to full rulers HA. is that how we are approaching this?

On the other hand if the design for rulers HA was wrong, we should update that spec. I think that should be the best way to approach this.

@friedrichg

Thank you for checking this.

This suggestion tries to fix only the API problem, which is fine if it brought us one step forward to full rulers HA. is that how we are approaching this?

Yes, so for now we do the API first because HA on evaluation is proving to be challenging. But a requirement for implementing evaluation HA would be allowing a higher replication factor and deduplication on the API response, which both will be tackled by this API HA proposal.

If we restart the first ruler, we can query API but do not get any recording rules or evaluation. Most of the work a ruler does is the evaluation, we are going to have lots of unused resources in those extra rulers.

No, the other rulers will still execute rule groups. The ring operation I think always returns a list in a consistent order. So the the ring op for RG1 can be give [ruler1, ruler2, ruler3] and the ring op for RG2 can give [ruler2, ruler1, ruler3]. In this case ruler1 will evaluate RG1 and ruler2 will evaluate RG2. But ruler1, ruler2, and ruler3 will have a copy of RG1 and RG2 configuration from s3 so that they can send it when there is an API request.