[Feature Request] Allow background callback tasks to programmatically retry later.
JamesKunstle opened this issue · comments
Is your feature request related to a problem? Please describe.
Background callbacks running in a distributed environment (Openshift or Kubernetes) can fail for reasons that are recoverable via application logic. e.g. a data resource isn't available at a point in time, but will be available in the future.
A bad solution is to have the background callback task check for a resource and sleep
for some amount of time, then check again later, and repeat. This consumes the Celery Worker
thread for no reason, and in our app, leads to worker pool exhaustion.
Describe the solution you'd like
It'd make sense for a background callback task to:
- check whether it can execute given the current state,
- proceed if it can,
- re-enqueue itself if it can't, yielding the worker thread to be used by another task.
Since background callbacks are Celery tasks, the features to enable programatic retries are already available with the bind
argument: a task receives a self
parameter that can be instructed to retry.
This might look like the following pseudocode:
@dash.callback(
... # Inputs and Outputs
background=True,
celery_bind=True, # first param to func must be for 'self'
retry_on_exceptions=[DBNotAvailableRetry],
)
def func(self, conn):
val = conn.get_value() # raises DBNotAvailableRetry exception
if not val:
self.retry(after="5s", exponential_backoff=True, jitter=True)
return val
Describe alternatives you've considered
Since Dash controls the context of the executing tasks when it's enqueued in Celery, the functionality of pushing the self
parameter into the background callback arguments could be avoided if Dash instead implemented exception handling that would trigger retries when caught.
@celery_app.task(
bind=True
)
def dash_bg_callback_wrapper(self, user_func, args):
try:
results = user_func(*args)
return results
except dash.BG_RETRY_EXCEPTION as e:
self.retry(
after=e.args["after"] # user could set this, knowing their app- would default to 0 time before retry.
)
Without being an expert in the inner-workings of Dash, here's my alternate suggestion in pseudocode inside the existing implementation, noted with JKUNSTLE NOTE
.
def _make_job_fn(fn, celery_app, progress, key):
cache = celery_app.backend
# JKUNSTLE NOTE: added bind=True, add `self` param to function
@celery_app.task(name=f"long_callback_{key}", bind=True)
def job_fn(self, result_key, progress_key, user_callback_args, context=None):
def _set_progress(progress_value):
if not isinstance(progress_value, (list, tuple)):
progress_value = [progress_value]
cache.set(progress_key, json.dumps(progress_value, cls=PlotlyJSONEncoder))
maybe_progress = [_set_progress] if progress else []
ctx = copy_context()
def run():
c = AttributeDict(**context)
c.ignore_register_page = False
context_value.set(c)
try:
if isinstance(user_callback_args, dict):
user_callback_output = fn(*maybe_progress, **user_callback_args)
elif isinstance(user_callback_args, (list, tuple)):
user_callback_output = fn(*maybe_progress, *user_callback_args)
else:
user_callback_output = fn(*maybe_progress, user_callback_args)
except PreventUpdate:
# Put NoUpdate dict directly to avoid circular imports.
cache.set(
result_key,
json.dumps(
{"_dash_no_update": "_dash_no_update"}, cls=PlotlyJSONEncoder
),
)
# JKUNSTLE NOTE: catches user-raised retry exception so
# Dash can control parameterization api.
except RetryBGTask as e:
# JKUNSTLE NOTE: retry in 1 second.
# should optionally be configured by e.args parameters, passed up from user function.
self.retry(countdown=1)
except Exception as err: # pylint: disable=broad-except
cache.set(
result_key,
json.dumps(
{
"long_callback_error": {
"msg": str(err),
"tb": traceback.format_exc(),
}
},
),
)
else:
cache.set(
result_key, json.dumps(user_callback_output, cls=PlotlyJSONEncoder)
)
ctx.run(run)
return job_fn
@T4rk1n You've helped me with a feature request before- hoping to bump for some help!
I think retrying callbacks (normal callbacks too) could be a good feature addition.
The API, I think it could have the arguments be in callback arguments and I also like the args in the error so something like.
from dash import RetryCallback
class CustomError(Exception):
pass
@callback(
...,
retry=True, # Global retry all errors.
retry_after=360, # Global retry after argument
retry_on_error=[CustomError, RetryCallback], # Retry only on those errors.
)
def cb(*args):
raise RetryCallback(after=1) # Custom retry after argument.
For the implementation I think we could catch the exceptions in dispatch and use the renderer to resend the same callback in a timeout. Can catch directly for normal callbacks and bg response has key in the response dict "long_callback_error"
that is currently re-raised as LongCallbackError
, would need to add more info to reconstitute the error for the retry_on_error
.
I like it! I might change retry_after
and after
to retry_delay
and delay
to be a little more specific ("after" could be an event instead of a time, or could even be taken as "retry after these errors")
Two other pieces come to mind you might want with this:
max_tries
- might default to something like 5 so this doesn't kick off an infinite loop if the error isn't really transient, but we could supportNone
or some such if you want to allow an infinite loop.- Some way to report on retry status. Maybe
retry_status=Output('my_div', 'children')
, and by default we construct a string likeattempt #1 failed with CustomError, trying again
but you can provide your own message likeraise RetryCallback(status=f"Oops, the database was locked on attempt #{ctx.retry_num}, we'll try again in 2 minutes")
orstatus=dcc.Markdown(...)
or whatever.
@alexcjohnson Those params make a lot of sense to me. I'd also like to be able to enable retry_jitter
and retry_exponential_backoff
to avoid the situation of "Everything is retrying all at once and thus there's a windfall of requests to work through".
@alexcjohnson @T4rk1n What's the typical timeline on a feature request like this for ya'll? (Not meant in a demanding tone, just curious so I can set objectives based on your timeline)
@JamesKunstle it's a great idea, but hasn't made it onto our roadmap yet so I can't say when we'd get to it. Best bet if you want it quick is to contribute a PR, or if your org has resources to commit we could talk about sponsoring development.
Just passing by, but the tenacity
package might be of interest here as it provides a flexible decorator interface.
Off the top of my head, it should be stacked like so:
@app.callback(...)
@tenacity.retry(...)
def my_callback(...):
...
@alexcjohnson great! thank you for the info, I was just curious. I'm working on another PR so I may learn enough through that to take a swing at this. No pressure implied though; "OSS maintainers owe you nothing" and all that.
@jborman-stonex Thanks! I'll check that out