temporalio / sdk-dotnet

Temporal .NET SDK

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Bug] If a child workflow is in progress when the service goes down, no workflows are resumed

tgrieger-sf opened this issue · comments

What are you really trying to do?

Validating my understanding of how Temporal handles resuming child workflows.

Describe the bug

When running a simple workflow that spans child workflows, if a child workflow is in progress and I kill the application, restarting the application does not resume the workflows like I would expect.

If I, instead, kill the application while a child workflow is NOT running and then resume it, the parent workflow resumes no problem.

I've linked essentially what I'm doing, just boiled down to exactly what is needed to reproduce. What I don't know is if this expected no matter what, an issue with the dotnet sdk, or something else. Any help is much appreciated.

Minimal Reproduction

Minimal reproduction with instructions here: https://github.com/tgrieger-sf/TemporalChildWorkflowBug

Environment/Versions

  • OS and processor: x64 Windows 10
  • Temporal Version: server 1.20.1
  • Are you using Docker or Kubernetes or building Temporal from source? Running it from temporal start dev

Additional context

N/A

I suspect your problem is that you have crashed an activity without the server knowing. For all but the most immediate activities, you should set a HeartbeatTimeout and heartbeat regularly as a keep alive to let the server know the activity hasn't crashed. See https://github.com/temporalio/sdk-dotnet#activity-heartbeating-and-cancellation. You should also set a StartToCloseTimeout with the max amount of time you expect an activity attempt to take before it should retry.

The child workflows resume, but they have nothing to resume because they are waiting on the activity. But since you are not heartbeating your activity or setting a reasonable timeout, the server thinks the activity is still running.

(can continue to discuss here, or can also discuss in forum at https://community.temporal.io/ or #dotnet-sdk in Slack)

I'll close the issue, updated the ScheduleToCloseTimeout to StartToCloseTimeout and set it at a more reasonable 1 second and that did it. Thanks!

Note that 1 second may not be very reasonable actually for schedule to close. That means it has to be picked up, processed including all retries, and completed within a second. May work for a test, but your normal workflow may need a bit more leniency.