[Bug] If a child workflow is in progress when the service goes down, no workflows are resumed

Question

[Bug] If a child workflow is in progress when the service goes down, no workflows are resumed

tgrieger-sf opened this issue a year ago · comments

What are you really trying to do?

Validating my understanding of how Temporal handles resuming child workflows.

Describe the bug

When running a simple workflow that spans child workflows, if a child workflow is in progress and I kill the application, restarting the application does not resume the workflows like I would expect.

If I, instead, kill the application while a child workflow is NOT running and then resume it, the parent workflow resumes no problem.

I've linked essentially what I'm doing, just boiled down to exactly what is needed to reproduce. What I don't know is if this expected no matter what, an issue with the dotnet sdk, or something else. Any help is much appreciated.

Minimal Reproduction

Minimal reproduction with instructions here: https://github.com/tgrieger-sf/TemporalChildWorkflowBug

Environment/Versions

OS and processor: x64 Windows 10
Temporal Version: server 1.20.1
Are you using Docker or Kubernetes or building Temporal from source? Running it from temporal start dev

Additional context

N/A

Chad Retz · Answer 1 · Thu Jun 08 2023 03:37:01 GMT+0800 (China Standard Time)

I suspect your problem is that you have crashed an activity without the server knowing. For all but the most immediate activities, you should set a HeartbeatTimeout and heartbeat regularly as a keep alive to let the server know the activity hasn't crashed. See https://github.com/temporalio/sdk-dotnet#activity-heartbeating-and-cancellation. You should also set a StartToCloseTimeout with the max amount of time you expect an activity attempt to take before it should retry.

The child workflows resume, but they have nothing to resume because they are waiting on the activity. But since you are not heartbeating your activity or setting a reasonable timeout, the server thinks the activity is still running.

(can continue to discuss here, or can also discuss in forum at https://community.temporal.io/ or #dotnet-sdk in Slack)

Trevor Grieger · Answer 2 · Thu Jun 08 2023 04:01:17 GMT+0800 (China Standard Time)

I'll close the issue, updated the ScheduleToCloseTimeout to StartToCloseTimeout and set it at a more reasonable 1 second and that did it. Thanks!

Chad Retz · Answer 3 · Thu Jun 08 2023 06:22:53 GMT+0800 (China Standard Time)

Note that 1 second may not be very reasonable actually for schedule to close. That means it has to be picked up, processed including all retries, and completed within a second. May work for a test, but your normal workflow may need a bit more leniency.