Retry failed queries

Question

Retry failed queries

vcschapp opened this issue 3 years ago · comments

User Story

As an application writer, I want my application to be resilient to transitive failures that Insights does not protect me from.

Background

As of 2021-10-07, an Insights query which is accepted, syntactically valid, and running against valid log groups may simply "fail" for unexplained reasons. When this happens, the CloudWatch Logs Insights service puts the query into status "Failed" and the QueryManager will log:

2021/10/07 23:46:58 incite: QueryManager(0xc00063ba00) unexpected terminal status chunk "[some query here]" [2021-10-02 17:00:00 +0000 UTC..2021-10-02 17:15:00 +0000 UTC): Failed

The stream will then return an error to the same effect.

The customer should be protected from these transient failures to some degree.

Task List

Add a configuration line item in Config that specifies a finite number of times to allow retrying a chunk.
- Could be a fixed number per chunk.
- Could also be a throttled global number per QueryManager to try to prevent retry storms during big outage type events.
Implement the code in QueryManager.

Victor Schappert · Answer 1 · Mon Oct 11 2021 13:27:14 GMT+0800 (China Standard Time)

Some related starting point questions:

Should retry be on by default?
How to extend Config without introducing backward incompatibility?

IMO the answer to the first question is yes. You should get some basic sensible resiliency without having to specify it explicitly.

The answer to the second question is to add a new field that's either a pointer or an interface, and replace the nil value with the default value on construction of mgr.

Victor Schappert · Answer 2 · Mon Oct 11 2021 13:34:20 GMT+0800 (China Standard Time)

Next set of questions relate to how much retrying we potentially want to do.

If we retry too much and CWL Insights is having a bad day, things will take a very long time.
However, this probably isn't a major consideration since being able to depend on accurate is probably more important than giving inaccurate results faster in a bad outage scenario.
I think we want to optimize for good behavior when CWL Insights is having a "typical" day, which means somewhere between 0.1% and 2% of queries fail.
Also if CWL Insights is having a horrible day and being very slow, user can always close the QM and bring everything to a screeching halt.
Do we want to have a kind of global anti-retry throttle (global in the QM) which starts blocking retries if the overall failure rate is high?
- Meaning, if say 25% of queries are failing, such that we detect we've had 50 retries in the last 10 seconds or something, do we open the circuit breaker and temporarily just stop retrying?
- I think for now this is too complicated and there isn't enough evidence to support it. We know that failed query chunks are pretty rare on a typical day and the cost of redoing 1% of them isn't going to be that significant.

Victor Schappert · Answer 3 · Mon Oct 11 2021 13:49:16 GMT+0800 (China Standard Time)

Based on the above, I would go with something quite simple along these lines:

const (
        // DefaultRetry specifies the default number of retry attempts for
        // query chunks which fail in the CloudWatch Logs Insights service,
        // due to a transient issue where Insights sets the chunk status to
        // "Failed".
        DefaultRetry = 2

        // MaxRetry specifies the maximum number of retry attempts for
        // failed query chunks.
       MaxRetry = 5
)

type Config struct {
        ...

        // Retry specifies how many times the QueryManager should re-submit a
        // query chunk which suffered a transient failure in the CloudWatch Logs
        // Insights service before ailing the query which the chunk belongs to fails
        // with a TerminalQueryStatusError.
        //
        // If Retry is nil, the value DefaultRetry is used as the retry count. Simply
        // leaving this field nil should give good results in most applications.
        //
        // Retry must be either nil or a pointer to a non-negative integer less
        // than or equal to MaxRetry. To disable retries, set it to point to a zero
        // value.
        Retry *int
}

Victor Schappert · Answer 4 · Mon Oct 11 2021 13:59:17 GMT+0800 (China Standard Time)

Nuance: What to do with retries for Preview mode queries?

It should work fairly well some of the time - the chunk's ptr table should help dedupe many of the new redundant results, but you have to assume there are cases where the retry would bring in a whole different set of results than the initial and basically cause chaos.

I'm inclined to say that Preview is incompatible with Retry.

But then: Should Retry exist on the stream then, and not the QM? Otherwise you get into weird cases where setting Preview in the QuerySpec would override the desired retry behavior configured in the QM. So yes, let's move it to the stream.

Victor Schappert · Answer 5 · Mon Oct 11 2021 14:10:42 GMT+0800 (China Standard Time)

This is silly.

Why should anyone have to configure anything, anyway? QM already does infinite retry in the case of 500s, why should this be different?

New plan: no configuration, and infinite retry of failed queries is always on as long as you're not in Preview mode.

Victor Schappert · Answer 6 · Mon Oct 18 2021 04:22:23 GMT+0800 (China Standard Time)

Implemented it in commit f577360a as always-on infinite retry. If future events prove we need to have a max counter (no API change) or expose a max counter (API change) we can do that then.

This change should make it easier to do chunk splitting, also.

Will be part of the v1.1.0 release, coming soon.