plotly / dash

Data Apps & Dashboards for Python. No JavaScript Required.

Home Page:https://plotly.com/dash

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Feature Request] Allow background callback tasks to programmatically retry later.

JamesKunstle opened this issue · comments

Is your feature request related to a problem? Please describe.

Background callbacks running in a distributed environment (Openshift or Kubernetes) can fail for reasons that are recoverable via application logic. e.g. a data resource isn't available at a point in time, but will be available in the future.

A bad solution is to have the background callback task check for a resource and sleep for some amount of time, then check again later, and repeat. This consumes the Celery Worker thread for no reason, and in our app, leads to worker pool exhaustion.

Describe the solution you'd like
It'd make sense for a background callback task to:

  1. check whether it can execute given the current state,
  2. proceed if it can,
  3. re-enqueue itself if it can't, yielding the worker thread to be used by another task.

Since background callbacks are Celery tasks, the features to enable programatic retries are already available with the bind argument: a task receives a self parameter that can be instructed to retry.

This might look like the following pseudocode:

@dash.callback(
	... # Inputs and Outputs
	background=True,
	celery_bind=True, # first param to func must be for 'self'
	retry_on_exceptions=[DBNotAvailableRetry],
)
def func(self, conn):
	val = conn.get_value() # raises DBNotAvailableRetry exception
	if not val:
		self.retry(after="5s", exponential_backoff=True, jitter=True)
	return val

Describe alternatives you've considered
Since Dash controls the context of the executing tasks when it's enqueued in Celery, the functionality of pushing the self parameter into the background callback arguments could be avoided if Dash instead implemented exception handling that would trigger retries when caught.

@celery_app.task(
    bind=True
)
def dash_bg_callback_wrapper(self, user_func, args):
    try:
        results = user_func(*args)
        return results
    except dash.BG_RETRY_EXCEPTION as e:
        self.retry(
            after=e.args["after"] # user could set this, knowing their app- would default to 0 time before retry.
        )
    

Without being an expert in the inner-workings of Dash, here's my alternate suggestion in pseudocode inside the existing implementation, noted with JKUNSTLE NOTE.

def _make_job_fn(fn, celery_app, progress, key):
    cache = celery_app.backend

    # JKUNSTLE NOTE: added bind=True, add `self` param to function
    @celery_app.task(name=f"long_callback_{key}", bind=True)
    def job_fn(self, result_key, progress_key, user_callback_args, context=None):
        def _set_progress(progress_value):
            if not isinstance(progress_value, (list, tuple)):
                progress_value = [progress_value]

            cache.set(progress_key, json.dumps(progress_value, cls=PlotlyJSONEncoder))

        maybe_progress = [_set_progress] if progress else []

        ctx = copy_context()

        def run():
            c = AttributeDict(**context)
            c.ignore_register_page = False
            context_value.set(c)
            try:
                if isinstance(user_callback_args, dict):
                    user_callback_output = fn(*maybe_progress, **user_callback_args)
                elif isinstance(user_callback_args, (list, tuple)):
                    user_callback_output = fn(*maybe_progress, *user_callback_args)
                else:
                    user_callback_output = fn(*maybe_progress, user_callback_args)
            except PreventUpdate:
                # Put NoUpdate dict directly to avoid circular imports.
                cache.set(
                    result_key,
                    json.dumps(
                        {"_dash_no_update": "_dash_no_update"}, cls=PlotlyJSONEncoder
                    ),
                )

            # JKUNSTLE NOTE: catches user-raised retry exception so
            # Dash can control parameterization api.
            except RetryBGTask as e:
                # JKUNSTLE NOTE: retry in 1 second.
                # should optionally be configured by e.args parameters, passed up from user function.
                self.retry(countdown=1)

            except Exception as err:  # pylint: disable=broad-except
                cache.set(
                    result_key,
                    json.dumps(
                        {
                            "long_callback_error": {
                                "msg": str(err),
                                "tb": traceback.format_exc(),
                            }
                        },
                    ),
                )
            else:
                cache.set(
                    result_key, json.dumps(user_callback_output, cls=PlotlyJSONEncoder)
                )

        ctx.run(run)

    return job_fn

@T4rk1n You've helped me with a feature request before- hoping to bump for some help!

I think retrying callbacks (normal callbacks too) could be a good feature addition.

The API, I think it could have the arguments be in callback arguments and I also like the args in the error so something like.

from dash import RetryCallback

class CustomError(Exception):
    pass

@callback(
     ...,
     retry=True,  # Global retry all errors.
     retry_after=360,   # Global retry after argument
     retry_on_error=[CustomError, RetryCallback],   # Retry only on those errors. 
)
def cb(*args):
   raise RetryCallback(after=1)   # Custom retry after argument.

For the implementation I think we could catch the exceptions in dispatch and use the renderer to resend the same callback in a timeout. Can catch directly for normal callbacks and bg response has key in the response dict "long_callback_error" that is currently re-raised as LongCallbackError, would need to add more info to reconstitute the error for the retry_on_error.

cc @alexcjohnson @Coding-with-Adam

I like it! I might change retry_after and after to retry_delay and delay to be a little more specific ("after" could be an event instead of a time, or could even be taken as "retry after these errors")

Two other pieces come to mind you might want with this:

  • max_tries - might default to something like 5 so this doesn't kick off an infinite loop if the error isn't really transient, but we could support None or some such if you want to allow an infinite loop.
  • Some way to report on retry status. Maybe retry_status=Output('my_div', 'children'), and by default we construct a string like attempt #1 failed with CustomError, trying again but you can provide your own message like raise RetryCallback(status=f"Oops, the database was locked on attempt #{ctx.retry_num}, we'll try again in 2 minutes") or status=dcc.Markdown(...) or whatever.

@alexcjohnson Those params make a lot of sense to me. I'd also like to be able to enable retry_jitter and retry_exponential_backoff to avoid the situation of "Everything is retrying all at once and thus there's a windfall of requests to work through".

@alexcjohnson @T4rk1n What's the typical timeline on a feature request like this for ya'll? (Not meant in a demanding tone, just curious so I can set objectives based on your timeline)

@JamesKunstle it's a great idea, but hasn't made it onto our roadmap yet so I can't say when we'd get to it. Best bet if you want it quick is to contribute a PR, or if your org has resources to commit we could talk about sponsoring development.

Just passing by, but the tenacity package might be of interest here as it provides a flexible decorator interface.

Off the top of my head, it should be stacked like so:

@app.callback(...)
@tenacity.retry(...)
def my_callback(...):
    ...

@alexcjohnson great! thank you for the info, I was just curious. I'm working on another PR so I may learn enough through that to take a swing at this. No pressure implied though; "OSS maintainers owe you nothing" and all that.

@jborman-stonex Thanks! I'll check that out