Add retries

Question

Add retries

rucek opened this issue a year ago · comments

Jacek Kunicki commented a year ago

Retries

Background

We'd like to have a retry mechanism supporting:

various retry schedules:
- direct retry - retrying up to a given number of times with no delay between subsequent retries,
- delayed retry - retrying up to a given number of times with a fixed delay between subsequent retries,
- retry with backoff - retrying up to a given number of times with an increasing delay between subsequent retries; this can optionally include a jitter, i.e. a random factor in the delay between subsequent attempts (see: https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/)
various definitions of the logic to retry:
- direct, i.e. f: => T,
- Either, i.e. f: => Either[E, T]
- Try, i.e. f: => Try[T]

Proposed API

Heavily inspired by https://github.com/softwaremill/retry

import ox.retry
import scala.concurrent.duration.*

def foo: Int = ???
def bar: Either[String, Int] = ???
def baz: Try[Int]

// various logic definitions
retry.directly(3)(foo)
retry.directly(3).either(foo)
retry.directly(3).tried(foo) // can't use "try"

// various retry schedules
retry.directly(3)(foo)
retry.delay(3, 100 millis)(foo)
retry.backoff(3, 100 millis)(foo) // optional jitter defaults to Jitter.None

// Either with non-default jitter
retry.backoff(3, 100 millis, Jitter.Full).either(bar) // other jitters are: Equal, Decorrelated

Questions/doubts

wouldn't it be better to include the retry schedules (directly/delay/backoff) as another paramater to retry, e.g.
```
retry(delay(3, 100 millis), foo) // however, how would we handle Try/Either?
```
is there any way to distinguish between f: => T and f: Try[T]/f: Either[E, T] without using dedicated functions like tried/either?

Adam Warski · Answer 1 · Tue Nov 14 2023 00:13:18 GMT+0800 (China Standard Time)

Thanks! That's a good start :)

So first I think we might consider extending the wish-list a bit. One thing I'm currently lacking is to decide if retries should be continued or modified basing on the result - this can include both a successful result (e.g. -1 can signal a "retry", same as a Left, Result etc.), as well as an exception. Maybe we could handle different outcomes using different retry strategies? E.g. a connection error to a DB would require a delay, but a transaction serializability failure would require an immediate retry

As for including the schedule as a parameter to retry - I like this idea :). And that's because maybe we can try to unify retry and repeat using a single API? It's not that different ... one retries the computation if the result is somehow "invalid", the second repeats the computation as long as the result is "valid" (e.g. repeat sending a ping message every second).

Finally, the technical question - distinguishing => T and => Try[T] - this probably can be done using an implicit witness, which would be normally erased. Same trick you can do to create function overloads when the erasure of the types is the same.

Krzysztof Ciesielski · Answer 2 · Tue Nov 14 2023 15:26:21 GMT+0800 (China Standard Time)

There are a few concepts in cats-retry that may be worth adopting:

The notion of operation success (also present in softwaremill/retry) - a function T => Boolean
The notion of retryable failure, called worthRetrying in cats-retry - a function which decides whether a non-successful operation should be retried.
RetryPolicy as data, that can be defined separately, or even composed of smaller policies. In this case, a a policy means how we should retry when 1. operation failed and 2. is worth retrying: immediately, after a delay, etc.
Side-effecting operations to run on retry - a custom function that the we can use for logging, metrics, etc.

For composing retry policies, cats-retry offers operations like:

join: If one wants to give up, both give up for example delay(5.seconds).join(maxRetries(5)), where maxRetries(5) represents retry.directly(5) from your examples. BTW join is IMO not the best name.
meet: If both want to give up, it gives up. I don't think we want this, I don't see any important beneficial cases other than the idea of capping backoff, which can be handled by backoff settings.
followedBy: When one gives up, switch to the next one. Sounds pretty useful.

retry(op: => T)(policy) - Use default success condition: for op: => T it's "doesn't throw", for op: => Either[E, T] it's Right(_), for op =>: Try[_] it's Success[_]. Use default isRetryable, which means any unsuccesful outcome.
retry(op: => T, wasSuccessful: T => Boolean)(policy): custom success condition, default retry condition (any unsuccessful outcome)
retry(op: => T, wasSuccessful: T => Boolean, isRetryable: Try[T] => Boolean)(policy): custom success condition, custom retry condition - depending on exceptions or returned values.
retry(op: => Either[E, T], isRetryable: Try[E] => Boolean)(policy): default success condition (Right(_)), custom retry condition - depending on exceptions or Left values
etc.

If overloading is too tricky, or turns out to be too difficult to handle for the user, we may consider predefined methods similar to cats-retry, like:

retryOnSomeErrors(op: => Either[E, T], isRetryable: E => Boolean)(policy) // will not retry if op throws
retryOnSomeResultsAndSomeErrors(op: => Either[E, T], wasSuccessful: T => Boolean, isRetryableError: E => Boolean, isRetryableResult: T => Boolean)(policy)
retryOnFailuresAndSomeErrors(op: => Either[E, T], wasSuccessful: T => Boolean, isRetryable: E => Boolean)(policy)
retryOnThrow(op :=> T, isRetryable: Throwable => Boolean)(policy)
etc., the names are of course just examples.

Bonus, can be left for implementation in the future:
Adam asked about possibility to "retry immediately on transaction exceptions and delay on database exceptions".
We can achieve this by adding "evaluation condition" to policies and a special builder:

def op: T = ...
retryCond(op)(
  delay(5.seconds).whenThrownA[DBConnectionException], // creates a ConditionalThrowableRetryPolicy
  immediately.whenThrown {
    case _: TransactionException => true
  }
)

retryCond is an additional operation which may require repeating all the variants, but it simplifies typechecking of the conditions (they have to match the type of unsuccessful op). This also allows to skip isRetryable, it can be evaluated by checking first fulfilled policy condition.

Similarly for other types of op:

def op: Either[E, T] = ...

retryCond(op)(resultPolicies = List( // List[ConditionalRetryPolicy[T]]
  delay(5.seconds).when[T](_.code == 300),
  immediately.when[T](_.code != 300), 
  retryableErrorPolicies = List( // List[ConditionalRetryPolicy[E]]
  immediately.whenMatches[E] {
    case _: MyError1 | _: MyError2 => true
  }
  )
)

Conditional policies would act like composed with 'join', which is IMO enough.