Exponential Backoff Retry Custom Policy

This is a implementation of the Exponential BackOff Retry algorithm that increases the backoff period for each retry attempt in case of app error. Applying this to your API or proxy you would be able to:

Define an initial interval whose first attempt should wait
Define a multiplier, used to increment the intervals between retries
Define a maximum interval, after which no increment on the interval will occur
Define a maximum elapsed time as a breaking point to stop retrying
Define expressions to validate the application's response. Expected responses can be configured in case no error occurs, or specific error expressions.
Perform a dynamic retry strategy, abstracting from the complexity of retrying at application level

Why?

The idea behind the Exponential Backoff Retry algorithm is to provide a greater degree of resilience in our services when performing a backoff that is not static in the event of an error event. In this way, under a persistent error scenario, we do not overload the impacted system/service, but we provide a graceful degradation mechanism increasing the waiting time between retries after each retry failure.

How?

The logic uses the values injected during the configuration of the policy to calculate the following variables:

loopSize: this variable indicates the number of iterations to perform (used by the for-each). It is calculated at the beginning, and in the following way: ceil (ROUNDUP) of default_max_elapsed_time / 1000 / default_multiplier
waitTime: this value indicates the time to wait. It is updated as follows:
- At the beginning it is empty, because you have to try at least once to find out if there is an error
- If there is an error the first time, it takes the value of default_initial_interval
- If there are successive errors, it takes the value of waitTime * default_multiplier

Other variables are calculated during the process but are not specific to the algorithm, but are for internal use for flow control. This policy does not have any dependency outside http-policy and mule-core components.

Usage

After publishing to Exchange, follow these steps to apply the policy to an existing managed API (or proxy):

Log into Anypoint Platform
Enter API Manager
Click on the API version for the application you want to apply the policy to
Click on Policies
Click on Apply New Policy
Filter by 'Custom' category and select 'Exponential Backoff Retry'. Click on 'Configure Policy' button
Give value to the policy's parameters:

Parameter	Purpose
default_initial_interval	Indicates the initial wait time (when the first error happens)
default_max_elapsed_time	Indicates the maximum wait time. After this time, no retry will be performed. Similar to a timeout.
default_max_interval	Indicates the upper bound of the wait interval. Once this value is reached, the next retry will not be multiplied by the default_multiplier
default_multiplier	Indicates the increment value between repetitions. Between attempts, the calculation of the current wait time is given by previous interval value * multiplier

Development

The following commands are required during development phase

Task	Command
Package policy	mvn clean install
Publish to Exchange - Make sure to update the pom.xml file with your org ID -	mvn deploy

Limitations

This policy makes use of the for-each module. Since there is no break condition to exit the loop, this policy iterates through all the objects in the array (please see "How?" Section). This means that it is convenient to apply this policy with small values of intervals and elapsed time, to keep the loop fast to iterate (please see Benchmark section below). Of course, the current logic could be replaced with custom code (java module, script execution). It is up to the user to implement it. Just keep in mind that one of the main motivations for this policy is not to have too many dependencies.

Benchmark

The policy has been tested using a Mule application that returns error or ok at random, based on a number generated with the random() Dataweave function (https://docs.mulesoft.com/mule-runtime/4.3/dw-core-functions-random). If it is odd, it returns an error (MULE:EXPRESSION, "org.mule.weave.v2.runtime.core.exception.DivisionByZeroException: Division by zero"), otherwise the result is satisfactory. The purpose of this test was to emulate a transient error, i.e. network failure or server overload, to check effectiveness around error rate and overall performance. Please see the following results.

Application config

Deployment Model	Runtime Version	Worker Size	Workers
CloudHub (US East - Ohio - )	4.3.0	0.1	1

Test #0 - NO policy applied -

Test suite configuration

# of request	# of threads	# of total concurrent requests
100	2	200

Results

Avg	Median	Min	Max	Error Rate	Throughput
398 ms	378 ms	349 ms	1410 ms	%46.50	4.9/sec

Test #1

Policy configuration

default_initial_interval	default_max_elapsed_time	default_max_interval	default_multiplier
2000 ms	10000 ms	10000 ms	1.5

Test suite configuration

# of request	# of threads	# of total concurrent requests
100	2	200

Results

Loop Size	Avg	Median	Min	Max	Error Rate	Throughput
7	7813 ms	3391 ms	369 ms	41822 ms	%00.00	15.1/min

Test #2

Policy configuration

default_initial_interval	default_max_elapsed_time	default_max_interval	default_multiplier
2000 ms	10000 ms	5000 ms	1.5

Test suite configuration

# of request	# of threads	# of total concurrent requests
100	2	200

Results

Loop Size	Avg	Median	Min	Max	Error Rate	Throughput
7	4584 ms	2416 ms	369 ms	39133 ms	%00.00	25.4/min

Test #3

Policy configuration

default_initial_interval	default_max_elapsed_time	default_max_interval	default_multiplier
2000 ms	10000 ms	3500 ms	1.5

Test suite configuration

# of request	# of threads	# of total concurrent requests
100	2	200

Results

Loop Size	Avg	Median	Min	Max	Error Rate	Throughput
7	2019 ms	1120 ms	367 ms	20428 ms	%00.00	55.4/min

Test #4

Policy configuration

default_initial_interval	default_max_elapsed_time	default_max_interval	default_multiplier
2000 ms	600000 ms	3500 ms	1.5

Test suite configuration

# of request	# of threads	# of total concurrent requests
100	2	200

Results

Loop Size	Avg	Median	Min	Max	Error Rate	Throughput
400	3579 ms	1572 ms	365 ms	37890 ms	%00.00	32.6/min

Test #5

Policy configuration

default_initial_interval	default_max_elapsed_time	default_max_interval	default_multiplier
2000 ms	600000 ms	120000 ms	1.5

Test suite configuration

# of request	# of threads	# of total concurrent requests
500	1	500

Results

Loop Size	Avg	Median	Min	Max	Error Rate	Throughput	95% percentile
400	2772 ms	1385 ms	361 ms	299453 ms	%00.20	21.6/min	5615

Test #6

Policy configuration

default_initial_interval	default_max_elapsed_time	default_max_interval	default_multiplier
2000 ms	600000 ms	60000 ms	1.5

Test suite configuration

# of request	# of threads	# of total concurrent requests
500	1	500

Results

Loop Size	Avg	Median	Min	Max	Error Rate	Throughput	95% percentile
400	1953 ms	479 ms	365 ms	49785 ms	%00.00	30.7/min	5220

Contribution

Want to contribute? Great!

For public contributions - Just fork the repo, make your updates and open a pull request!
For internal contributions - Use a simplified feature workflow following these steps:
- Clone the repo
- Create a feature branch using the naming convention feature/name-of-the-feature
- Once it's ready, push your changes
- Open a pull request for a review

mulesoft-catalyst / exponential-backoff-retry-policy-mule-4

Exponential Backoff Retry Custom Policy

Why?

How?

Usage

Development

Limitations

Benchmark

Application config

Test #0 - NO policy applied -

Test suite configuration

Results

Test #1

Policy configuration

Test suite configuration

Results

Test #2

Policy configuration

Test suite configuration

Results

Test #3

Policy configuration

Test suite configuration

Results

Test #4

Policy configuration

Test suite configuration

Results

Test #5

Policy configuration

Test suite configuration

Results

Test #6

Policy configuration

Test suite configuration

Results

Contribution

About