[Feature Request] Allow background callback tasks to programmatically retry later.

Question

[Feature Request] Allow background callback tasks to programmatically retry later.

JamesKunstle opened this issue 6 months ago · comments

Is your feature request related to a problem? Please describe.

Background callbacks running in a distributed environment (Openshift or Kubernetes) can fail for reasons that are recoverable via application logic. e.g. a data resource isn't available at a point in time, but will be available in the future.

A bad solution is to have the background callback task check for a resource and sleep for some amount of time, then check again later, and repeat. This consumes the Celery Worker thread for no reason, and in our app, leads to worker pool exhaustion.

Describe the solution you'd like
It'd make sense for a background callback task to:

check whether it can execute given the current state,
proceed if it can,
re-enqueue itself if it can't, yielding the worker thread to be used by another task.

Since background callbacks are Celery tasks, the features to enable programatic retries are already available with the bind argument: a task receives a self parameter that can be instructed to retry.

This might look like the following pseudocode:

@dash.callback(
	... # Inputs and Outputs
	background=True,
	celery_bind=True, # first param to func must be for 'self'
	retry_on_exceptions=[DBNotAvailableRetry],
)
def func(self, conn):
	val = conn.get_value() # raises DBNotAvailableRetry exception
	if not val:
		self.retry(after="5s", exponential_backoff=True, jitter=True)
	return val

Describe alternatives you've considered
Since Dash controls the context of the executing tasks when it's enqueued in Celery, the functionality of pushing the self parameter into the background callback arguments could be avoided if Dash instead implemented exception handling that would trigger retries when caught.

@celery_app.task(
    bind=True
)
def dash_bg_callback_wrapper(self, user_func, args):
    try:
        results = user_func(*args)
        return results
    except dash.BG_RETRY_EXCEPTION as e:
        self.retry(
            after=e.args["after"] # user could set this, knowing their app- would default to 0 time before retry.
        )

James Kunstle · Answer 1 · Sat Jan 13 2024 01:53:17 GMT+0800 (China Standard Time)

Without being an expert in the inner-workings of Dash, here's my alternate suggestion in pseudocode inside the existing implementation, noted with JKUNSTLE NOTE.

def _make_job_fn(fn, celery_app, progress, key):
    cache = celery_app.backend

    # JKUNSTLE NOTE: added bind=True, add `self` param to function
    @celery_app.task(name=f"long_callback_{key}", bind=True)
    def job_fn(self, result_key, progress_key, user_callback_args, context=None):
        def _set_progress(progress_value):
            if not isinstance(progress_value, (list, tuple)):
                progress_value = [progress_value]

            cache.set(progress_key, json.dumps(progress_value, cls=PlotlyJSONEncoder))

        maybe_progress = [_set_progress] if progress else []

        ctx = copy_context()

        def run():
            c = AttributeDict(**context)
            c.ignore_register_page = False
            context_value.set(c)
            try:
                if isinstance(user_callback_args, dict):
                    user_callback_output = fn(*maybe_progress, **user_callback_args)
                elif isinstance(user_callback_args, (list, tuple)):
                    user_callback_output = fn(*maybe_progress, *user_callback_args)
                else:
                    user_callback_output = fn(*maybe_progress, user_callback_args)
            except PreventUpdate:
                # Put NoUpdate dict directly to avoid circular imports.
                cache.set(
                    result_key,
                    json.dumps(
                        {"_dash_no_update": "_dash_no_update"}, cls=PlotlyJSONEncoder
                    ),
                )

            # JKUNSTLE NOTE: catches user-raised retry exception so
            # Dash can control parameterization api.
            except RetryBGTask as e:
                # JKUNSTLE NOTE: retry in 1 second.
                # should optionally be configured by e.args parameters, passed up from user function.
                self.retry(countdown=1)

            except Exception as err:  # pylint: disable=broad-except
                cache.set(
                    result_key,
                    json.dumps(
                        {
                            "long_callback_error": {
                                "msg": str(err),
                                "tb": traceback.format_exc(),
                            }
                        },
                    ),
                )
            else:
                cache.set(
                    result_key, json.dumps(user_callback_output, cls=PlotlyJSONEncoder)
                )

        ctx.run(run)

    return job_fn

James Kunstle · Answer 2 · Tue Jan 16 2024 12:51:51 GMT+0800 (China Standard Time)

@T4rk1n You've helped me with a feature request before- hoping to bump for some help!

Philippe Duval · Answer 3 · Wed Jan 17 2024 23:11:56 GMT+0800 (China Standard Time)

I think retrying callbacks (normal callbacks too) could be a good feature addition.

The API, I think it could have the arguments be in callback arguments and I also like the args in the error so something like.

from dash import RetryCallback

class CustomError(Exception):
    pass

@callback(
     ...,
     retry=True,  # Global retry all errors.
     retry_after=360,   # Global retry after argument
     retry_on_error=[CustomError, RetryCallback],   # Retry only on those errors. 
)
def cb(*args):
   raise RetryCallback(after=1)   # Custom retry after argument.

For the implementation I think we could catch the exceptions in dispatch and use the renderer to resend the same callback in a timeout. Can catch directly for normal callbacks and bg response has key in the response dict "long_callback_error" that is currently re-raised as LongCallbackError, would need to add more info to reconstitute the error for the retry_on_error.

cc @alexcjohnson @Coding-with-Adam

Alex Johnson · Answer 4 · Thu Jan 18 2024 05:46:30 GMT+0800 (China Standard Time)

I like it! I might change retry_after and after to retry_delay and delay to be a little more specific ("after" could be an event instead of a time, or could even be taken as "retry after these errors")

Two other pieces come to mind you might want with this:

max_tries - might default to something like 5 so this doesn't kick off an infinite loop if the error isn't really transient, but we could support None or some such if you want to allow an infinite loop.
Some way to report on retry status. Maybe retry_status=Output('my_div', 'children'), and by default we construct a string like attempt #1 failed with CustomError, trying again but you can provide your own message like raise RetryCallback(status=f"Oops, the database was locked on attempt #{ctx.retry_num}, we'll try again in 2 minutes") or status=dcc.Markdown(...) or whatever.

James Kunstle · Answer 5 · Thu Jan 18 2024 22:03:41 GMT+0800 (China Standard Time)

@alexcjohnson Those params make a lot of sense to me. I'd also like to be able to enable retry_jitter and retry_exponential_backoff to avoid the situation of "Everything is retrying all at once and thus there's a windfall of requests to work through".

James Kunstle · Answer 6 · Wed Jan 24 2024 01:45:45 GMT+0800 (China Standard Time)

@alexcjohnson @T4rk1n What's the typical timeline on a feature request like this for ya'll? (Not meant in a demanding tone, just curious so I can set objectives based on your timeline)

Alex Johnson · Answer 7 · Tue Feb 06 2024 09:30:28 GMT+0800 (China Standard Time)

@JamesKunstle it's a great idea, but hasn't made it onto our roadmap yet so I can't say when we'd get to it. Best bet if you want it quick is to contribute a PR, or if your org has resources to commit we could talk about sponsoring development.

jborman-stonex · Answer 8 · Wed Feb 07 2024 05:26:38 GMT+0800 (China Standard Time)

Just passing by, but the tenacity package might be of interest here as it provides a flexible decorator interface.

Off the top of my head, it should be stacked like so:

@app.callback(...)
@tenacity.retry(...)
def my_callback(...):
    ...

James Kunstle · Answer 9 · Wed Feb 07 2024 06:19:34 GMT+0800 (China Standard Time)

@alexcjohnson great! thank you for the info, I was just curious. I'm working on another PR so I may learn enough through that to take a swing at this. No pressure implied though; "OSS maintainers owe you nothing" and all that.

James Kunstle · Answer 10 · Wed Feb 07 2024 06:19:44 GMT+0800 (China Standard Time)

@jborman-stonex Thanks! I'll check that out