[Bug] If a child workflow is in progress when the service goes down, no workflows are resumed
tgrieger-sf opened this issue · comments
What are you really trying to do?
Validating my understanding of how Temporal handles resuming child workflows.
Describe the bug
When running a simple workflow that spans child workflows, if a child workflow is in progress and I kill the application, restarting the application does not resume the workflows like I would expect.
If I, instead, kill the application while a child workflow is NOT running and then resume it, the parent workflow resumes no problem.
I've linked essentially what I'm doing, just boiled down to exactly what is needed to reproduce. What I don't know is if this expected no matter what, an issue with the dotnet sdk, or something else. Any help is much appreciated.
Minimal Reproduction
Minimal reproduction with instructions here: https://github.com/tgrieger-sf/TemporalChildWorkflowBug
Environment/Versions
- OS and processor: x64 Windows 10
- Temporal Version: server 1.20.1
- Are you using Docker or Kubernetes or building Temporal from source? Running it from
temporal start dev
Additional context
N/A
I suspect your problem is that you have crashed an activity without the server knowing. For all but the most immediate activities, you should set a HeartbeatTimeout
and heartbeat regularly as a keep alive to let the server know the activity hasn't crashed. See https://github.com/temporalio/sdk-dotnet#activity-heartbeating-and-cancellation. You should also set a StartToCloseTimeout
with the max amount of time you expect an activity attempt to take before it should retry.
The child workflows resume, but they have nothing to resume because they are waiting on the activity. But since you are not heartbeating your activity or setting a reasonable timeout, the server thinks the activity is still running.
(can continue to discuss here, or can also discuss in forum at https://community.temporal.io/ or #dotnet-sdk
in Slack)
I'll close the issue, updated the ScheduleToCloseTimeout
to StartToCloseTimeout
and set it at a more reasonable 1 second and that did it. Thanks!
Note that 1 second may not be very reasonable actually for schedule to close. That means it has to be picked up, processed including all retries, and completed within a second. May work for a test, but your normal workflow may need a bit more leniency.