channable / opnieuw

One weird trick to make your code more reliable

Home Page:https://tech.channable.com/posts/2020-02-05-opnieuw.html

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

retry_window_after_first_call_in_seconds not working as expected

yan-hic opened this issue · comments

The intended behavior - I think - is to time out the attempts for a given API call e.g. try 4 times but raise error if last attempt is beyond 60s.

However this is not how it's working currently, at least with retry_async. The value of retry_window_after_first_call_in_seconds is currently the total runtime.

To illustrate,

  • retry_window_after_first_call_in_seconds = 240
  • 10000 async calls with some failing because 429 (Too Many Requests), retries logs the attempts for the individual calls but gives up after total runtime + random delay > 240s.

The error Next attempt would be after retry deadline. No point retrying. is misleading/incorrect as the deadline should apply to the retries for a given call, not to the runtime.

A debug output shows:

retries - 2020-06-15 19:45:30,363 - Sleeping for 8.449 seconds after attempt 1
retries - 2020-06-15 19:45:30,363 - Sleeping for 16.238 seconds after attempt 1
retries - 2020-06-15 19:45:30,370 - Sleeping for 28.286 seconds after attempt 1
... (note: all first time attempts)
retries - 2020-06-15 19:48:46,969 - Sleeping for 19.310 seconds after attempt 1
retries - 2020-06-15 19:48:47,619 - Sleeping for 26.760 seconds after attempt 1
retries - 2020-06-15 19:48:48,596 - Sleeping for 12.881 seconds after attempt 1
retries - 2020-06-15 19:48:50,346 - Next attempt would be after retry deadline. No point retrying.

Not sure what the random delay was here but runtime (duration) being 200s, if the value was >40, retries would raise the underlying error.

Current (bad) workaround would be to set the value to a very large duration so retries does not time out e.g. 86400 (1 day).

Thanks for taking the time to open an issue.

The retry window starts at the time when the first call to the decorated function is initiated. After a call to the decorated function ends, because the function throws,

  • If the retry window is already over, @retry re-throws.
  • If the retry window is not yet over, @retry picks a delay to wait before making the next call.
  • If the next call would be after the retry window is over, due to that delay, there is no point in waiting but then not retrying, so @retry logs the line you observed and re-throws immediately.
  • If the next call would be within the retry window, wait for the delay, then make the next call.

logs the attempts for the individual calls but gives up after total runtime + random delay > 240s

Yes, the retry window is the window of time (wall clock time) that starts when the first call is initiated. As the readme puts it:

retry_window_after_first_call_in_seconds is the maximum number of seconds after the first call was initiated, where we would still do a new attempt.

It does not refer to the time spent waiting.

as the deadline should apply to the retries for a given call, not to the runtime.

Time spent in the decorated function also counts. This is by design. If the decorated function supports a timeout (for example, because it makes an http call), then the retry window should be larger than the timeout, otherwise you can end up in this situation where after the first call the retry window is already over.

A good retry window to timeout ratio depends a bit on your situation, but in our codebase we usually start with a retry window 3× or 4× the timeout, to ensure that there is room for 2 or 3 retries after the initial call. We also prefer an aggressive timeout with more attempts, over longer timeouts with fewer attempts, because request durations tend to be fast at the 50th percentile, but they can be slow in the 95th percentile. Rather than waiting even longer in an already unlucky case, and wasting the retry window, we prefer to retry early and hope to be less unlucky in that attempt.

set the value to a very large duration so retries does not time out e.g. 86400 (1 day).

Note that the retry window also affects the delay between retries. The delay is computed such that if every call failed instantly, the time spent waiting in between the max_calls calls is at most equal to the retry window. Due to jitter, the expected time waiting is half of the retry window. So if you set the retry window to a very high value, the wait between attempts will also become longer.

Does this make sense?

Closing since retry_window_after_first_call_in_seconds is working as intended.