microsoft / DurableFunctionsMonitor

A monitoring/debugging UI tool for Azure Durable Functions

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Feature Request] Restarting multiple instances

mike268 opened this issue · comments

If we have a lot of failed instances, e.g. due to backend system failures, we currently have to restart the instances individually.
In our environment, restarting several instances for a certain filter setting would make our work much easier

A Button to reprocess all current selected instances would be nice

Version 6.4.
Backend MSSQL

This is a necessary feature. I work on a project that has millions of instances. Sometimes, thousands of instances fail for some reason. The failure could happen for any cause (for example transient error retries exhaused). Having to click on each failed instances thousands of times to restart them is not practical. I've had teams explicity not want to adopt Durable Functions specifically for this gap (which is unfortunate since the SDK is very powerful).

For that reason, I had to create a seperate re-usable package that exposes a bulk restart HTTP trigger along with an Orchestration that handles restarting instances. You can restart everything, or you can pass a specific error (for example, any transient errors that should be resolved upon a retry).

So basically, I'm using Durable Functions to asynchronously restart Durable Functions. Having that capability built into Durable Functions Monitor, along with a new form that allows for a custom error message to be passed, would be extremely helpful (as suggested by mike268). For small projects, this may not be needed, but when you are working with millions of records with potentialy thousands of instances fail, automating bulk restarts is crucial.

@russor45 , I agree, it would be a very useful feature.

along with a new form that allows for a custom error message to be passed

Can you please clarify this? What kind of custom error message you would like to have and where should it be passed?

@scale-tone,

The custom error message could be a text box that would allow filtering and restarting instances based on the search text provided. It would search any failed instancece output that containes the search text. Additionally, it could be helpful to have options to filter failed instances based on other criteria, such as the start time or last updated time, to further refine the scope of the bulk restart operation. The default filter could be to restart all failed instances (for those that don't need to filter and just want all failed instances restarted).

For example, if you want to restart all instances that failed due to a specific transient error (e.g., 'Connection Timeout' or error code 'ABC123'), you could enter that error message or code in the custom error message field, and the bulk restart operation would target only those failed instances output text that match the specified criteria. This would save a significant amount of time and effort compared to manually identifying and restarting each failed instance individually, especially when dealing with thousands of failed instances.

While a synchronous bulk restart approach may not perform well, one solution could be to have a built-in orchestrator that fans out the bulk restart operation asynchronously. Currently, I've implemented a custom bulk restart orchestrator, but having this functionality natively integrated into the Durable Functions Monitor tool, along with a UI for configuring the error criteria, would be a valuable addition. That way others don't have to re-invent the wheel and have to write their own bulk restart fan out feature as I did. Im not saying this is the only solution. Im just saying that as long as it performance well, and can handle large volumes of restarts in a timely manner, then that is the problem statement.

@mike268 What are your thoughts?

@russor45 that would be perfect for my use case

The suggested way of implementing it would be:

  1. Reuse the main screen for making a filtered list of instances the batch operation to be applied to. Filtering options there should be more than enough for that.
  2. (Since the list on the main screen uses infinite scrolling) add buttons to the main screen for fetching the next page and all remaining pages.
  3. Add the "Batch operations..." command to the burger button's menu (top-left corner in standalone/injected) and to the Task Hub's drop-down menu (in vscode). That command should bring a modal dialog with a single-select dropdown of supported operations, a textbox for extra parameters and a "Start" button. Once triggered, the dialog should apply the chosen operation to all instances currently shown on the main list. Successes/failures should be properly visualized.

Any objections to this approach?

No objections. Sounds good 👍