reactor / reactor-addons

Additional optional modules for the Reactor project

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Retry not playing well with onErrorContinue

sguillope opened this issue · comments

I'm fairly new to Reactor so my apologies if I'm misunderstanding something.

Versions

  • reactor-core 3.2.9
  • reactor-extra 3.2.3

Reproducible sample

Here's a code sample that reproduces the issue I'm having:

final Logger log = LoggerFactory.getLogger("SAMPLE");
Flux.just(0, 1)
    .flatMap(value ->
        Mono.<Integer>create(sink -> sink.success(1 / value))
            // removing retryWhen operator makes it work as expected
            .retryWhen(Retry.onlyIf(retryContext -> false))
            .onErrorResume(throwable -> Mono.just(100))
    )
    .onErrorContinue((throwable, o) -> log.warn("value ignored {}", o))
    .doOnNext(result -> log.debug("next is {}", result))
    .blockLast();

Expected result

  • onErrorContinue is not called
  • onErrorResume is called with the "fallback" value
  • doOnNext is called
  • the sequence finishes successfully (hangs forever)
  • output is:
DEBUG reactor.retry.DefaultRetry - Stopping retries since predicate returned false, retry context: iteration=1 exception=java.lang.ArithmeticException: / by zero backoff={0ms}
DEBUG SAMPLE - next is 100
DEBUG SAMPLE - next is 1

Actual result

  • onErrorContinue is called
  • onErrorResume is not called
  • doOnNext is not called
  • sequence never finishes
  • output is:
DEBUG reactor.retry.DefaultRetry - Stopping retries since predicate returned false, retry context: iteration=1 exception=java.lang.ArithmeticException: / by zero backoff={0ms}
WARN SAMPLE - value ignored [0,java.lang.ArithmeticException: / by zero]
DEBUG SAMPLE - next is 1

Context

Here's some additional context.
We have a WebFlux application processing messages from a queue indefinitely. When processing a message, we use the WebClient to send a request to another service. Said service may return a 400 response which we sometimes want to treat as a success. We use WebClient's onStatus to map the response to a custom exception which is then handled downstream through an onErrorResume operator. Moreover the whole flux sequence uses the onErrorContinue(BiConsumer) operator to ensure it will continue processing other messages from the queue in case of an unexpected error.
Finally the call to the external service is wrapped in a Retry operator using retryWhen(Retry.onlyIf(...).retryMax(...)) which activates only for "retryable" exceptions.

To summarise we have something like this: (code heavily simplified)

queueSource.flatMap(message -> 
    makeWebClientCall(message)
        .onStatus(
            HttpStatus.BAD_REQUEST::equals,
            response -> new ExpectedException()
        )
        .bodyToFlux(SomeResponse.class)
        .retryWhen(
            Retry.onlyIf(retryContext -> 
                retryContext.exception() instance of RetryableException
            ).retryMax(1)
        )
        .onErrorResume(ExpectedException.class, throwable -> "success")
)
.onErrorContinue((throwable, o) -> log.error("{}", throwable.getMessage(), throwable))
.subscribe()

The problem we're hitting is that when we return an ExpectedException error, it never hits onErrorResume and goes directly into onErrorContinue. If we remove the retry logic it works as expected.

Keep in mind that onErrorContinue is a very special operator that short-circuit the Reactive Streams onError propagation, as well as breaks the contract that any error is a terminating event. It should only be used with care, if you deeply understand the consequences. Unlike other operators, it doesn't influence what happens below it, but above it. And not all operators react to it.

With that in mind...

Why isn't onErrorResume invoked?

There is an error short-circuiting logic put in place by onErrorContinue downstream that is "visible" by just and flatMap, as well as operators inside the flatMap.

Since this logic doesn't discriminate, it happily swallows both ArithmeticException and exceptions produced by a retryWhen exhaustion.

Who actually invokes the onErrorContinue handler?

In the case of the test, it is a concatMap which is part of the Function produced by Retry for retryWhen.

How can I resume from errors inside the flatMap while also shielding myself against unexpected errors in the main sequence

Either move the onErrorContinue, inhibit it inside the flatMap or make it peek inside errors and rethrow the one you're prepared to handle explicitly:

Moving the onErrorContinue

Since it acts on its whole upstream sequence, you could put it above the flatMap:

queueSource
    .onErrorContinue((throwable, o) -> log.error("unexpected error in queueSource: {}", throwable.getMessage(), throwable))
    .flatMap(message -> 
    makeWebClientCall(message)
        .onStatus(
            HttpStatus.BAD_REQUEST::equals,
            response -> new ExpectedException()
        )
        .bodyToFlux(SomeResponse.class)
        .retryWhen(
            Retry.onlyIf(retryContext -> 
                retryContext.exception() instance of RetryableException
            ).retryMax(1)
        )
        .onErrorResume(ExpectedException.class, throwable -> "success")
)
.subscribe();

This prevents error-termination for operators that support it in the queueSource. ⚠️ If that queueSource is NOT made of a combination of Reactor operators but is a custom one, chances are it won't support onErrorContinue. ⚠️

Inhibit the onErrorContinue

An onErrorContinue handler that re-throws the exception is basically restoring standard behavior. Coupled to the fact that this is based on Context, which propagates upwards but is limited to the boundary of a flatMap when attached to the inner Publishers of said flatMap, this can be used to inhibit the main onErrorContinue inside the flatMap:

queueSource.flatMap(message -> 
    makeWebClientCall(message)
        .onStatus(
            HttpStatus.BAD_REQUEST::equals,
            response -> new ExpectedException()
        )
        .bodyToFlux(SomeResponse.class)
        .retryWhen(
            Retry.onlyIf(retryContext -> 
                retryContext.exception() instance of RetryableException
            ).retryMax(1)
        )
        .onErrorResume(ExpectedException.class, throwable -> "success")
        //inhibits onErrorContinue in all operators above WITHIN the flatMap scope:
        .onErrorContinue((t, o) -> { throw Exceptions.propagate(t); })
)
.onErrorContinue((throwable, o) -> log.error("{}", throwable.getMessage(), throwable))
.subscribe()

Alternatively, for a Flux the same can be achieved with .onErrorStop(). This made me notice that this operator is missing in Mono... 🐛 (fixed in reactor/reactor-core#1728)

Avoid considering retryable exceptions

This is maybe the least interesting option, because it has the caveat that if same type of exception happens in the main sequence that exception will terminate the main sequence:

queueSource.flatMap(message -> 
    makeWebClientCall(message)
        .onStatus(
            HttpStatus.BAD_REQUEST::equals,
            response -> new ExpectedException()
        )
        .bodyToFlux(SomeResponse.class)
        .retryWhen(
            Retry.onlyIf(retryContext -> 
                retryContext.exception() instance of RetryableException
            ).retryMax(1)
        )
        .onErrorResume(ExpectedException.class, throwable -> "success")
        //inhibits onErrorContinue in all operators above WITHIN the flatMap scope:
        .onErrorContinue((t, o) -> { throw Exceptions.propagate(t); })
)
.onErrorContinue((throwable, o) -> {
    if (throwable instanceof RetryableException) {
        throw Exceptions.propagate(throwable);
    }
    log.error("{}", throwable.getMessage(), throwable));
})
.subscribe()

The other thing to look out for is the fact that Retry will wrap the exception in the case where the number of attempts is exhausted.

Thanks so much Simon, really appreciate the time you took to provide such a thorough reply 🥇

I had indeed found that the concatMap in the Retry logic was causing the short-circuit. (even in cases where the predicate is not matched)
Like you said care is required when using onErrorContinue. In complex applications it might not be so obvious that any error happening upstream could be short-circuited (be it in application code or deep in 3rd-party libraries). I think when we put it in, we saw it as a last resort fallback for any error that wouldn't have been handled by one of the onErrorResume operators upstream. The main reason being that we want to make sure that the stream never ends. Maybe there is a better way to do that without using onErrorContinue?

queueSource is coming from wrapping a Future returned by the AWS SQS client. It roughly corresponds to this:

Mono.fromFuture(() -> sqsAsyncClient.receiveMessage(receiveMessageRequest))
    .subscribeOn(Schedulers.elastic())
    .publishOn(Schedulers.elastic())
    .map(ReceiveMessageResponse::messages)
    .flatMapMany(Flux::fromIterable)
    .repeatWhen(this::delayIfNoItems)
    .retryWhen(this.everyErrorWithBackoff())
    .parallel()
    .runOn(Schedulers.parallel())
    .flatMap(this::process) // this is where makeWebClientCall() would eventually be called
    .sequential()
    .onErrorContinue(
        (throwable, o) ->
            log.error("Error while listening to SQS queue. Continuing", throwable))
    .subscribe(
        aVoid -> log.error("Queue subscription has ended, that shouldn't happen"),
        throwable -> log.error("Error while listening to SQS queue.", throwable));

Moving the onErrorContinue

I don't think moving it up would work for us because we want to make sure that no unhandled error will terminate the stream, be it inside or after the flatMap.

Inhibit the onErrorContinue

This would probably work for our scenario. The only thing I'm not a big fan of is that now the code in the flatMap has to be aware (or assume) that there is an onErrorContinue operator somewhere downstream. In the sample I gave it's pretty obvious but in a real application the code could be split across different classes. In our case the logic of makeWebClientCall is actually happening in a separate "Client" class while the onErrorContinue is in a "Service" class 3 levels of separation up from the "Client". It's almost like every time we use onErrorResume anywhere we'd need to have it followed by an inhibiting onErrorContinue operator, just in case.

Thanks for mentioning onErrorStop, I didn't really pay attention to that one. We'd probably go with that for now along with a comment, at least would be better than having to disable retries.

In the end I'm wondering:

  • if the documentation of onErrorContinue should have a stronger warning about this short-circuiting behaviour
  • if there is a need for a similar error-handling operator which wouldn't short-circuit but still provide a way to prevent errors from terminating a stream. Kind of like onErrorResume but without the need to return anything (unless we can achieve the same with that one? I couldn't make it work)

Merci.

if there is a need for a similar error-handling operator which wouldn't short-circuit but still provide a way to prevent errors from terminating a stream. Kind of like onErrorResume but without the need to return anything (unless we can achieve the same with that one? I couldn't make it work)

This is going against the Reactive Streams specification. An error terminates the sequence. onErrorContinue is an operator that breaks this rule, and as such it should be used with extra care and full understanding of the implications.

But since your original source is a Future, I don't think there is a real need for onErrorContinue: not a lot of places that unexpected errors can occur.

  • a transient error in the Future itself would be retriable with a simple retry
  • an unexpected error in the flatMap / concatMap that process could be dealt with by using the flatMapDelayError/concatMapDelayError variants: these let each source object be mapped to an inner sequence before terminating the main sequence if one of the inner fails.

Thanks again Simon for you answer. We are going to do away with onErrorContinue now that we have a better understanding. I'm closing this issue.