dapr / dapr

Dapr is a portable, event-driven, runtime for building distributed applications across cloud and edge.

Home Page:https://dapr.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Proposal] Workflow building block and engine

johnewart opened this issue · comments

In what area(s)?

/area runtime

What?

This document proposes that the Dapr runtime be extended to include a new workflow building block. This building block, in combination with a lightweight, portable, workflow engine will enable developers to express workflows as code that can be executed, interacted with, monitored, and debugged using the Dapr runtime.

Why?

Many complex business processes are well modeled as a workflow - a set of steps needing to be orchestrated that require resiliency and guarantee completion (success or failure are both completions). To build such workflows, developers are often faced with needing to solve a host of complex problems, including (but not limited to):

  • Scheduling
  • Lifecycle management
  • State storage
  • Monitoring and debugging
  • Resiliency
  • Failure handling mechanisms

Based on available data, it is clear that workflows are quite popular; at the time of this writing, the number of daily executions for hosted workflows and their tasks is in the billions per day across tens of thousands of Azure subscriptions.

What is a workflow?

A workflow, for the purpose of this proposal, is defined as application logic that defines a business process or data flow that:

  • Has a specific, pre-defined, deterministic lifecycle (e.g Pending -> Running -> [Completed | Failed | Terminated])
  • Is guaranteed to complete
  • Is durable (i.e completion in the face of transient errors)
  • Can be scheduled to start or execute steps at or after some future time
  • Can be paused and resumed (explicitly or implicitly)
  • Can execute portions of the workflow in serial or parallel
  • Can be directly addressed by external agents (i.e an instance of the workflow can be interacted with directly - paused, resumed, queried, etc.)
  • May be versioned
  • May be stateful
  • May create new sub-workflows and optionally wait for those to complete before progressing
  • May rely on external components to perform its job (i.e HTTPS API calls, pub/sub message queues, etc.)

Why Dapr?

Dapr already contains many of the building blocks required to provide reliability, scalability, and durability to the execution of workflows. Building such an engine inside Dapr, and providing the necessary building blocks will help to increase developer productivity through re-usability of existing features and independence from the underlying execution mechanism thereby increasing portability.

In addition to the built-in execution engine, Dapr can provide a consistent programming interface for interacting with third-party workflow execution systems (i.e AWS SWF, Apache Camel, Drools) for those who are already using these tools. Thereby providing a standardized interface for working with both external workflows as well as those running inside Dapr.

Proposal

High-level overview of changes

We propose that the following features / capabilities be added to the Dapr runtime:

  • A new "workflow" building block
  • A portable, lightweight, workflow engine embedded into the Dapr sidecar capable of supporting long-running, resilient, and durable workflows through Dapr's building blocks
  • An expressive, developer-friendly, programming model for building workflows as code
  • Support for containerized, declarative, workflows (such as the CNCF Serverless Workflow specification)
  • Extensions to the Dapr dashboard for monitoring / managing workflow execution
  • APIs for interacting with workflows

The Workflow building block

As mentioned before, this proposal includes the addition of a new workflow building block. Like most of the other Dapr building blocks (state stores, pubsub, etc.) the workflow building block will consist of two primary things:

  • A pluggable component model for integrating various workflow engines
  • A set of APIs for managing workflows (start, schedule, pause, resume, cancel)

Similar to the built-in support for actors, we also propose implementing a built-in runtime for workflows (see the DTFx-go engine described in the next section). Unlike actors, the workflow runtime component can be swapped out with an alternate implementation. If developers want to work with other workflow engines, such as externally hosted workflow services like Azure Logic Apps, AWS Step Functions, or Temporal.io, they can do so with alternate community-contributed workflow components.

The value of this building block for vendors is that workflows supported by their platforms can be exposed as APIs with support for HTTP and the Dapr SDKs. The less visible but benefits of mTLS, distributed tracing, etc. will also be available. Various abstractions, such as async HTTP polling, can also be supported via Dapr without the workflow vendor needing to implement it themselves.

Introducing DTFx-go

We propose adding a lightweight, portable, embedded workflow engine (DTFx-go) in the Dapr sidecar that leverages existing Dapr components, including actors and state storage, in its underlying implementation. By being lightweight and portable developers will be able to execute workflows that run inside DFTx-go locally as well as in production with minimal overhead; this enhances the developer experience by integrating workflows with the existing Dapr development model that users enjoy.

The new engine will be written in Go and inspired by the existing Durable Task Framework (DTFx) engine. We’ll call this new version of the framework DTFx-go to distinguish it from the .NET implementation (which is not part of this proposal) and it will exist as an open-source project with a permissive, e.g., Apache 2.0, license so that it remains compatible as a dependency for CNCF projects. Note that it’s important to ensure this engine remains lightweight so as not to noticeably increase the size of the Dapr sidecar.

Importantly, DTFx-go will not be exposed to the application layer. Rather, the Dapr sidecar will expose DTFx-go functionality over a gRPC stream. The Dapr sidecar will not execute any app-specific workflow logic or load any declarative workflow documents. Instead, app containers will be responsible for hosting the actual workflow logic. The Dapr sidecar can send and receive workflow commands over gRPC to and from connected app’s workflow logic, execute commands on behalf of the workflow (service invocation, invoking bindings, etc.). Other concerns such as activation, scale-out, and state persistence will be handled by internally managed actors. More details on all of this will be discussed in subsequent sections.

Execution, scheduling and resilience

Internally, Dapr workflow instances will be implemented as actors. Actors drive workflow execution by communicating with the workflow SDK over a gRPC stream. By using actors, the problem of placement and scalability are already solved for us.

placement

The execution of individual workflows will be triggered using actor reminders as they are both persistent and durable (two critical features of workflows). If a container or node crashes during a workflow’s execution, the actor’s reminder will ensure it gets activated again and resumes where it left off (using state storage to provide durability, see below).

To prevent a workflow from blocking (unintentionally) each workflow will be composed of two separate actor components, one acting as the scheduler / coordinator and the other performing the actual work (calling API services, performing computation, etc.).

execution

Storage of state and durability

In order for a workflow execution to reliably complete in the face of transient errors, it must be durable -- meaning that it is able to store data at checkpoints as it makes progress. To achieve this, workflow executions will rely on Dapr's state storage to provide stable storage such that the workflow can be safely resumed from a known-state in the event that it is explicitly paused or a step is prematurely terminated (system failure, lack of resources, etc.).

Workflows as code

The term "workflow as code" refers to the implementation of a workflow’s logic using general purpose programming languages. "Workflow as code" is used in a growing number of modern workflow frameworks, such as Azure Durable Functions, Temporal.io, and Prefect (Orion). The advantage of this approach is its developer-friendliness. Developers can use a programming language that they already know (no need to learn a new DSL or YAML schema), they have access to the language’s standard libraries, can build their own libraries and abstractions, can use debuggers and examine local variables, and can even write unit tests for their workflows just like they would any other part of their application logic.

The Dapr SDK will internally communicate with the DTFx-go gRPC endpoint in the Dapr sidecar to receive new workflow events and send new workflow commands, but these protocol details will be hidden from the developer. Due to the complexities of the workflow protocol, we are not proposing any HTTP API for the runtime aspect of this feature.

Support for declarative workflows

We expect workflows as code to be very popular for developers because working with code is both very natural for developers and is much more expressive and flexible compared to declarative workflow modeling languages. In spite of this, there will still be users who will prefer or require workflows to be declarative. To support this, we propose building an experience for declarative workflows as a layer on top of the "workflow as code" foundation. A variety of declarative workflows could be supported in this way. For example, this model could be used to support the AWS Step Functions workflow syntax, the Azure Logic Apps workflow syntax, or even the Google Cloud Workflow syntax. However, for the purpose of this proposal, we’ll focus on what it would look like to support the CNCF Serverless Workflow specification. Note, however, that the proposed model could be used to support any number of declarative multiple workflow schemas.

CNCF Serverless Workflows

Serverless Workflow (SLWF) consists of an open-source standards-based DSL and dev tools for authoring and validating workflows in either JSON or YAML. SLWF was specifically selected for this proposal because it represents a cloud native and industry standard way to author workflows. There are a set of already existing open-source tools for generating and validating these workflows that can be adopted by the community. It’s also an ideal fit for Dapr since it’s under the CNCF umbrella (currently as a sandbox project). This proposal would support the SLWF project by providing it with a lightweight, portable runtime – i.e., the Dapr sidecar.

Hosting Serverless Workflows

In this proposal, we use the Dapr SDKs to build a new, portable SLWF runtime that leverages the Dapr sidecar. Most likely it is implemented as a reusable container image and supports loading workflow definition files from Dapr state stores (the exact details need to be worked out). Note that the Dapr sidecar doesn’t load any workflow definitions. Rather, the sidecar simply drives the execution of the workflows, leaving all other details to the application layer.

API

Start Workflow API

HTTP / gRPC

Developers can start workflow instances by issuing an HTTP (or gRPC) API call to the Dapr sidecar:

POST http://localhost:3500/v1.0/workflows/{workflowType}/{instanceId}/start

Workflows are assumed to have a type that is identified by the {workflowType} parameter. Each workflow instance must also be created with a unique {instanceId} value. The payload of the request is the input of the workflow. If a workflow instance with this ID already exists, this call will fail with an HTTP 409 Conflict.

To support asynchronous HTTP polling pattern by HTTP clients, this API will return an HTTP 202 Accepted response with a Location header containing a URL that can be used to get the status of the workflow (see further below). When the workflow completes, this endpoint will return an HTTP 200 response. If it fails, the endpoint can return a 4XX or 5XX error HTTP response code. Some of these details may need to be configurable since there is no universal protocol for async API handling.

Input bindings

For certain types of automation scenarios, it can be useful to trigger new instances of workflows directly from Dapr input bindings. For example, it may be useful to trigger a workflow in response to a tweet from a particular user account using the Twitter input binding. Another example is starting a new workflow in response to a Kubernetes event, like a deployment creation event.

The instance ID and input payload for the workflow depends on the configuration of the input binding. For example, a user may want to use a Tweet’s unique ID or the name of the Kubernetes deployment as the instance ID.

Pub/Sub

Workflows can also be started directly from pub/sub events, similar to the proposal for Actor pub/sub. Configuration on the pub/sub topic can be used to identify an appropriate instance ID and input payload to use for initializing the workflow. In the simplest case, the source + ID of the cloud event message can be used as the workflow’s instance ID.

Terminate workflow API

HTTP / gRPC

Workflow instances can also be terminated using an explicit API call.

POST http://localhost:3500/v1.0/workflows/{workflowType}/{instanceId}/terminate

Workflow termination is primarily an operation that a service operator takes if a particular business process needs to be cancelled, or if a problem with the workflow requires it to be stopped to mitigate impact to other services.

If a payload is included in the POST request, it will be saved as the output of the workflow instance.

Raise Event API

Workflows are especially useful when they can wait for and be driven by external events. For example, a workflow could subscribe to events from a pubsub topic as shown in the Phone Verification sample. However, this capability shouldn’t be limited to pub/sub events.

HTTP / gRPC

An API should exist for publishing events directly to a workflow instance:

POST http://localhost:3500/v1.0/workflows/{workflowType}/{instanceId}/raiseEvent

The result of the "raise event" API is an HTTP 202 Accepted, indicating that the event was received but possibly not yet processed. A workflow can consume an external event using the waitForExternalEvent SDK method.

Get workflow metadata API

HTTP / gRPC

Users can fetch the metadata of a workflow instance using an explicit API call.

GET http://localhost:3500/v1.0/workflows/{workflowType}/{instanceId}

The result of this call is workflow instance metadata, such as its start time, runtime status, completion time (if completed), and custom or runtime-specific status. If supported by the target runtime, workflow inputs and outputs can also be fetched using the query API.

Purge workflow metadata API

Users can delete all state associated with a workflow using the following API:

DELETE http://localhost:3500/v1.0/workflows/{workflowType}/{instanceId}

When using the embedded workflow component, this will delete all state stored by the workflow’s underlying actor(s).

Footnotes and Examples

Example 1: Bank transaction

In this example, the workflow is implemented as a JavaScript generator function. The "bank1" and "bank2" parameters are Microservice apps that use Dapr, each of which expose "withdraw" and "deposit" APIs. The Dapr APIs available to the workflow come from the context parameter object and return a "task" which effectively the same as a Promise. Calling yield on the task causes the workflow to durably checkpoint its progress and wait until Dapr responds with the output of the service method. The value of the task is the service invocation result. If any service method call fails with an error, the error is surfaced as a raised JavaScript error that can be caught using normal try/catch syntax. This code can also be debugged using a Node.js debugger.

Note that the details around how code is written will vary depending on the language. For example, a C# SDK would allow developers to use async/await instead of yield. Regardless of the language details, the core capabilities will be the same across all languages.

import { DaprWorkflowClient, DaprWorkflowContext, HttpMethod } from "dapr-client"; 

const daprHost = process.env.DAPR_HOST || "127.0.0.1"; // Dapr sidecar host 

const daprPort = process.env.DAPR_WF_PORT || "50001"; // Dapr sidecar port for workflow 

const workflowClient = new DaprWorkflowClient(daprHost, daprPort); 

// Funds transfer workflow which receives a context object from Dapr and an input 
workflowClient.addWorkflow('transfer-funds-workflow', function*(context: DaprWorkflowContext, op: any) { 
    // use built-in methods for generating psuedo-random data in a workflow-safe way 
    const transactionId = context.createV5uuid(); 

    // try to withdraw funds from the source account. 
    const success = yield context.invoker.invoke("bank1", "withdraw", HttpMethod.POST, { 
        srcAccount: op.srcAccount, 
        amount: op.amount, 
        transactionId 
    }); 

    if (!success.success) { 
        return "Insufficient funds"; 
    } 

    try { 
        // attempt to deposit into the dest account, which is part of a separate microservice app 
        yield context.invoker.invoke("bank2", "deposit", HttpMethod.POST, {
            destAccount: op.destAccount, 
            amount: op.amount, 
            transactionId 
        }); 
        return "success"; 
    } catch { 
        // compensate for failures by returning the funds to the original account 
        yield context.invoker.invoke("bank1", "deposit", HttpMethod.POST, { 
            destAccount: op.srcAccount, 
            amount: op.amount, 
            transactionId 
        }); 
        return "failure"; 
    } 
}); 

// Call start() to start processing workflow events 
workflowClient.start(); 

Example 2: Phone Verification

Here’s another sample that shows how a developer might build an SMS phone verification workflow. The workflow receives some user’s phone number, creates a challenge code, delivers the challenge code to the user’s SMS number, and waits for the user to respond with the correct challenge code.

The important takeaway is that the end-to-end workflow can be represented as a single, easy-to-understand function. Rather than relying directly on actors to hold state explicitly, state (such as the challenge code) can simply be stored in local variables, drastically reducing the overall code complexity and making the solution easily unit testable.

import { DaprWorkflowClient, DaprWorkflowContext, HttpMethod } from "dapr-client"; 

const daprHost = process.env.DAPR_HOST || "127.0.0.1"; // Dapr sidecar host 
const daprPort = process.env.DAPR_WF_PORT || "50001"; // Dapr sidecar port for workflow 
const workflowClient = new DaprWorkflowClient(daprHost, daprPort); 

// Phone number verification workflow which receives a context object from Dapr and an input 
workflowClient.addWorkflow('phone-verification', function*(context: DaprWorkflowContext, phoneNumber: string) { 

    // Create a challenge code and send a notification to the user's phone 
    const challengeCode = yield context.invoker.invoke("authService", "createSmsChallenge", HttpMethod.POST, { 
        phoneNumber 
    }); 

    // Schedule a durable timer for some future date (e.g. 5 minutes or perhaps even 24 hours from now) 
    const expirationTimer = context.createTimer(challengeCode.expiration); 

    // The user gets three tries to respond with the right challenge code 
    let authenticated = false; 

    for (let i = 0; i <= 3; i++) { 
        // subscribe to the event representing the user challenge response 
        const responseTask = context.pubsub.subscribeOnce("my-pubsub-component", "sms-challenge-topic"); 

        // block the workflow until either the timeout expires or we get a response event 
        const winner = yield context.whenAny([expirationTimer, responseTask]); 

        if (winner === expirationTimer) { 
            break; // timeout expired 
        } 

        // we get a pubsub event with the user's SMS challenge response 
        if (responseTask.result.data.challengeNumber === challengeCode.number) { 
            authenticated = true; // challenge verified! 
            expirationTimer.cancel(); 
            break; 
        } 
    } 

    // the return value is available as part of the workflow status. Alternatively, we could send a notification. 
    return authenticated; 
}); 

// Call listen() to start processing workflow events 
workflowClient.listen(); 

Example 3: Declarative workflow for monitoring patient vitals

The following is an example of a very simple SLWF workflow definition that listens on three different event types and invokes a function depending on which event was received.

{ 
    "id": "monitorPatientVitalsWorkflow", 
    "version": "1.0", 
    "name": "Monitor Patient Vitals Workflow", 
    "states": [ 
      { 
        "name": "Monitor Vitals", 
        "type": "event", 
        "onEvents": [ 
          { 
            "eventRefs": [ 
              "High Body Temp Event", 
              "High Blood Pressure Event" 
            ], 
            "actions": [{"functionRef": "Invoke Dispatch Nurse Function"}] 
          }, 
          { 
            "eventRefs": ["High Respiration Rate Event"], 
            "actions": [{"functionRef": "Invoke Dispatch Pulmonologist Function"}] 
          } 
        ], 
        "end": true 
      } 
    ], 
    "functions": "file://my/services/asyncapipatientservicedefs.json", 
    "events": "file://my/events/patientcloudeventsdefs.yml" 
} 

The functions defined in this workflow would map to Dapr service invocation calls. Similarly, the events would map to incoming Dapr pub/sub events. Behind the scenes, the runtime (which is built using the Dapr SDK APIs mentioned previously) handles the communication with the Dapr sidecar, which in turn manages the checkpointing of state and recovery semantics for the workflows.

(Fixed the placement image, there were previously two duplicate Actor entities)

Workflows are one of my favorite areas for actors so I'm happy to see them here :) Quick question though.

Can you describe more on how you're planning on using reminders? Are they just to let users start a workflow on a given period? Or, are we going to be registering reminders for every workflow step and then deleting them when they are done?

If it's the latter, that could end up being a lot of reminders. If that's the case we should call out that we'll use reminder partitioning as I think it'd be likely that we'd hit the tipping point for reminder scaling (which does vary based on the underlying statestore).

Looking forward to this! Initial Qs...

The description of DTFx-go reads "lightweight, portable, embedded workflow engine (DTFx-go) in the Dapr sidecar"

  • Does this mean that you get a workflow engine out-of-the-box, by default, in any environment that is running Dapr? So I don't have to select SLWF, or temporal.io etc?

  • Do I have to specify a state-management component to provide the state store behind the OOTB workflow engine (DTFx-go) ? i.e. With Dapr actors, I must define a state-management component to back the actor state.

Other thoughts

  • Is it wise to call this implementation DTFx-go? I'm specifically talking about the DTFx bit. I ask because DTFx brings a bunch of baggage/concepts/knowledge. Such as TaskHub, and the various storage Providers for the TaskHub. Could this introduce confusion, as customers may expect a degree of interoperability with DTFx?! For example. I had the initial thought that this might be compatible with DF Monitor utility - but it won't as I don't recognise there being a TaskHub concept in DTFx-go proposal thus far.
  • Assuming DF Monitor is NOT compatible, it might be worth lifting some of the concepts from DF Monitor and hosting similar concepts in the Dapr Dashboard to help with the observability and management of workflows.

Can you describe more on how you're planning on using reminders? Are they just to let users start a workflow on a given period? Or, are we going to be registering reminders for every workflow step and then deleting them when they are done?

If it's the latter, that could end up being a lot of reminders. If that's the case we should call out that we'll use reminder partitioning as I think it'd be likely that we'd hit the tipping point for reminder scaling (which does vary based on the underlying statestore).

The exact details are still being ironed out, but it's essentially a variation of the latter - i.e., one active reminder per workflow instance (though not necessarily per action). I agree that leveraging the reminder partitioning work is the right way to ensure this remains scalable. We need to do a bit more research here to figure out the details.

Does this mean that you get a workflow engine out-of-the-box, by default, in any environment that is running Dapr? So I don't have to select SLWF, or temporal.io etc?

Correct - the embedded engine will be the "out of the box" option for anyone that doesn't want to install additional infrastructure into their cluster. We want external workflow services to be supported by the building block (we expect many will prefer to use workflow systems that they already know and love), but not required.

Do I have to specify a state-management component to provide the state store behind the OOTB workflow engine (DTFx-go) ? i.e. With Dapr actors, I must define a state-management component to back the actor state.

Yes, this is essentially a programming model that sits on top of actors, so you'll still need to configure a state store that supports actors.

Is it wise to call this implementation DTFx-go? I'm specifically talking about the DTFx bit. I ask because DTFx brings a bunch of baggage/concepts/knowledge. Such as TaskHub, and the various storage Providers for the TaskHub. Could this introduce confusion, as customers may expect a degree of interoperability with DTFx?! For example.

I'm not too worried about confusion because DTFx-go is just an implementation detail that most users won't know or care about. Users of Dapr will simply be presented with "Dapr Workflow" as a concept and we wouldn't necessarily expose the same extensibility or tooling. The existing DTFx isn't super well-known outside of Azure Functions or internal Microsoft circles, so I'm assuming there won't be a lot of opportunity for confusion even for folks who care to look at the implementation details.

FWIW, the DTFx-go backend storage provider will one built specifically for storing state and load balancing via the Dapr Actors infrastructure.

Assuming DF Monitor is NOT compatible, it might be worth lifting some of the concepts from DF Monitor and hosting similar concepts in the Dapr Dashboard to help with the observability and management of workflows.

Yes, integration with the Dapr Dashboard is definitely part of the plan.

@cgillum got it thanks.

Suggestion : It might be worth updating the section Storage of state and durability to be a little more explicit that an “Actor compatible” dapr state store component is required in order to light-up the embedded engine.

  • Any indication if workflows must be deterministic, due to replay semantics?

Very interesting proposal... look forward to seeing it progress!

I wonder about the feasibility of a pluggable execution layer... certainly you can define contracts for required integration points (surfacing metadata, lifecycle management, state hydration, etc.) and implement, say, SLWF on top of that from scratch.

How would you envision that working for an existing runtime like AWS Step Functions or Temporal.io, which aren't necessarily built to plug into those abstractions, or in general to be driven "from the outside"?

For this to work, it seems like you would need a "workflow internals spec" to define the required integration points... and then need vendors to implement it. Or do I misunderstand?

Any indication if workflows must be deterministic, due to replay semantics?

Yes, thanks @olitomlinson for calling this out. @johnewart I think we need to update the description above to reflect this important coding constraint.

For this to work, it seems like you would need a "workflow internals spec" to define the required integration points... and then need vendors to implement it. Or do I misunderstand?

@jplane I wonder if there might be a slight misunderstanding. We're not proposing that the internal execution engine should support plugging in existing WF runtimes like Step Functions or Temporal.io. That would be a really hard problem to solve, as you suggested, and may not be in everyone's best interest. Rather, we're proposing two specific stories for how other workflow languages and/or runtimes can be pulled in:

  1. The WF building block contract (i.e. the HTTP APIs mentioned) can be used to interface with externally hosted workflow services, similar to how the state stores and pubsub building blocks work. In this case, the built-in engine isn't used at all.
  2. Things like SLWF (or other declarative workflow languages, including the AWS Step Functions spec - i.e. the Amazon States Language) can be supported by implementing a new runtime layer on top of the built-in engine's programming model. In this case, we're using the Dapr workflow engine and not any other existing engine.

The latter point (2) isn't strictly "pluggable extensibility", per-se, but more of a model for how developers could contribute their own declarative workflow runtimes that internally rely on the Dapr Workflow built-in engine. It's very similar to the POC SLWF prototype you and I built some time back on top of Durable Functions - the existing Durable engine was used to implement scheduling, durability, etc. and a layer on top was built to interpret the SLWF markup and interface with the Durable APIs.

I hope that makes sense. I can try to clarify further if it's still confusing.

Thx @cgillum for the explanation... that makes sense now.

I really like the direction... be ruthlessly sparse with the engine programming model, and let a thousand higher-order models bloom on top of it. Nice!

Not 100% sold on the BYOruntime sales pitch just yet... the Dapr WF engine's semantics will become a pseudo-standard, so anything that does state management, work scheduling, etc. differently may not be seamlessly swappable behind the building block contract (or, at least, might violate The Principle of Least Surprise for the unsuspecting caller). I see the appeal of swappable runtimes... just wondering about real-world challenges, too.

the Dapr WF engine's semantics will become a pseudo-standard

The goal is to make the Dapr WF APIs (and all Dapr APIs in general) a standard and via work that is on-going on our API-spec special interest group.

so anything that does state management, work scheduling, etc. differently may not be seamlessly swappable behind the building block contract (or, at least, might violate The Principle of Least Surprise for the unsuspecting caller). I see the appeal of swappable runtimes... just wondering about real-world challenges, too.

I agree with what you're saying here, but I actually think its valid. In this case, we certainly want to encourage users to choose the default, tested and optimized path of least resistance yet open the door for other runtimes if there are special considerations to be made.

Excited to finally see this being discussed! Which means I get to learn more about it myself :-)

I'm a bit unsure about the capabilities of the "worker" actor. Is it correct to say that all context.invoker.invoke commands in the sample code translate to an operation in the worker? Is it also safe to say the worker actor exists simply to provide a framework-compatible means of performing I/O, which is otherwise not allowed directly in the user code? Thanks!

Is it correct to say that all context.invoker.invoke commands in the sample code translate to an operation in the worker?

Behind the scenes, yes. This actor is designed to do any work that may take an indeterminate amount of time to complete, like service invocation. This frees up the scheduler actor to do other work, like respond to queries.

Is it also safe to say the worker actor exists simply to provide a framework-compatible means of performing I/O, which is otherwise not allowed directly in the user code?

Not necessarily. Technically, the scheduler actor could do all the I/O on behalf of the workflow code. The worker actor is really only for potentially long-running I/O, to keep the scheduler actor from getting blocked for too long (actors are single threaded). We may have the scheduler actor do other types of I/O directly, like publishing pub/sub messages.

Instead, app containers will be responsible for hosting the actual workflow logic.

  • Can a single container image host multiple different types of workflow? Any theoretical limits you are aware of at this point?

  • Can a container image that hosts a workflow also invoke endpoints hosted in that same container image?

Can a single container image host multiple different types of workflow? Any theoretical limits you are aware of at this point?

Yes, absolutely. The code samples above show only one call to workflowClient.addWorkflow(...), but multiple calls can be made to register multiple workflows from the same app.

Can a container image that hosts a workflow also invoke endpoints hosted in that same container image?

Yes, a single container image/app can host workflows and service invocation endpoints together, so if the context.invoker.invoke call targets the currently running Dapr app, then the same container image would be the one that receives the service invocation request.

Is it correct to say that all context.invoker.invoke commands in the sample code translate to an operation in the worker?

Behind the scenes, yes. This actor is designed to do any work that may take an indeterminate amount of time to complete, like service invocation. This frees up the scheduler actor to do other work, like respond to queries.

Is it also safe to say the worker actor exists simply to provide a framework-compatible means of performing I/O, which is otherwise not allowed directly in the user code?

Not necessarily. Technically, the scheduler actor could do all the I/O on behalf of the workflow code. The worker actor is really only for potentially long-running I/O, to keep the scheduler actor from getting blocked for too long (actors are single threaded). We may have the scheduler actor do other types of I/O directly, like publishing pub/sub messages.

Thanks, this makes sense. So I suppose that there will be a pre-determined mapping that makes certain APIs, like context.invoker.invoke always go onto the worker actor, while other more constrained APIs will go to the scheduler.

Something else that stood out to be is that I don't see any reference to context.invoker.invoke in the API listing of the original post. Will that be part of a different proposal? I'm interested in understanding what other methods and utilities can be accessed from the DaprWorkflowContext object :-) .

The original post goes into the details of the workflow building block APIs, which describe how existing app code can interact with Dapr workflows, whether self-hosted or externally hosted. APIs for implementing self-hosted workflows like context.invoker.invoke aren't currently enumerated. Right now, we're expecting to cover core Dapr APIs, like service invocation, pub/sub, bindings, etc. but will likely have a few others as well. Exact details TBD.

The context object — is part of the Workflow SDK, right?

User code must use this context object from the SDK? There is no option to use a HTTP API?

If what I’ve said is correct, this wouldn’t align with the dapr principle of being language agnostic, right? Might that cause a rub?

User code must use this context object from the SDK? There is no option to use a HTTP API?

Correct, and @johnewart mentioned this in the original post in the "Workflow as code" section:

The Dapr SDK will internally communicate with the DTFx-go gRPC endpoint in the Dapr sidecar to receive new workflow events and send new workflow commands, but these protocol details will be hidden from the developer. Due to the complexities of the workflow protocol, we are not proposing any HTTP API for the runtime aspect of this feature.

The building block APIs will be exposed over HTTP; just not the workflow runtime APIs. Indeed, it's a deviation from how other Dapr building blocks work, including actors, but I think it's an appropriate tradeoff that allows us to build a fully-featured workflows runtime implementation. It would be very difficult for someone to correctly implement the workflow runtime APIs using an HTTP client.

Indeed, it's a deviation from how other Dapr building blocks work

Not much of a deviation. The Configuration API started out as gRPC only, and the upcoming Distributed Lock API is also gRPC only.

Sorry I should have said

User code must use this context object from the SDK? There is no option to use a HTTP API or gRPC API

The API surfaces for the building blocks are simple hence why you can bring your own HTTP or GRPC client (and avoid using SDKs) but with the workflow SDK the API surface is complex (as Chris just mentioned) so not using a language specific WF SDK Is unavoidable.

To me it’s fair to say that this is a deviation. Trade-off? Sure, but still a deviation from the status quo. Not trying to be negative btw, just highlighting the gap and how it lightly challenges my perception and expectations of dapr, as a user/consumer.

The API surfaces for the building blocks are simple hence why you can bring your own HTTP or GRPC client (and avoid using SDKs) but with the workflow SDK the API surface is complex (as Chris just mentioned) so not using a language specific WF SDK Is unavoidable.

True, but you'll find it's the same with actors, that de-facto necessitate an Actor SDK to provide a simple programming model that is otherwise very complex when using the APIs directly.

Got it. I forget that dapr Actors require an SDK for the programming model. In that case, ignore everything I’ve said!

@olitomlinson -- I think those are great questions so thanks for asking them! I will avoid sounding like an echo chamber since Chris, Hal and Yaron mostly answered everyone's questions (they're so quick they answered before I even saw them!).

That being said, I can see a world where it might be possible to define workflows without using a language SDK, similar to how GitHub declares its workflows as YAML, with built-in predefined actions (or actions people have written similar to components). However, the explicit goal of this design is to avoid yet another workflow language ("yawful?" 😄) and allow developers to use their language of choice.

Got it. I forget that dapr Actors require an SDK for the programming model. In that case, ignore everything I’ve said!

No, you did bring up a valid point. While we aim to keep APIs accessible and usable over standard protocols directly (HTTP REST, gRPC) to increase adoption and be inclusive to all programming languages and frameworks, the first principle of Dapr to make developers successful does allow for more opinionated programming models like Actors and Workflows, provided there is a well reasoned, non-nightmarish way to extend them over HTTP and/or gRPC. So far this has worked very well for actors and I've reason to believe based on this proposal that it'll be the same for workflows.

Great proposal it is. I only have one minor question against API, that is, as you said, there's "A set of APIs for managing workflows (start, schedule, pause, resume, cancel)", but I cannot tell how a workflow can be paused and resumed from the section of "API".

I agree the wording is confusing, but I'm guessing the intent is to use the RaiseEvent API... some external process sends a custom 'PauseEvent' to the workflow, and then later a custom 'ResumeEvent'. The workflow just needs to understand and anticipate those events and react accordingly. Presumably the workflow programming model will expose the events from DaprWorkflowClient or similar (from the code examples above).

Assuming similarity to DTFx (mentioned above as inspiration for this proposal)... application semantics and communication protocol between workflow and outside world are left up to the workflow developer, subject to some constraints imposed by the runtime to provide durability and determinism guarantees.

@cgillum and @johnewart is that the basic idea?

@beiwei30 @jplane sorry for the confusion, this was an oversight. You are correct that pause/resume could be implemented in the workflow logic using external events, and that this is what people have done traditionally with DTFx and Durable Functions. However, our intent for Dapr Workflow is to make this a first class feature that’s built in since it tends to be a pretty useful feature. We’ll update the proposal description to include these APIs.

Just curious, is there any reason that this proposal is not added to the roadmap overview https://github.com/orgs/dapr/projects/52?

Just curious, is there any reason that this proposal is not added to the roadmap overview https://github.com/orgs/dapr/projects/52?

Yes, it hasn't been accepted yet.

Given the parity to Durable Functions programming model, is it in scope to bring across the ContinueAsNew API and the Entities API too? Thanks

is it in scope to bring across the ContinueAsNew API and the Entities API too?

ContinueAsNew, yes - this will be important for application patterns like eternal workflows.

Durable entities is TBD. Dapr already has native support for actors, but I can imagine we might bring it in to support things like distributed critical sections in workflows at some point.

Hi folks, this looks like a great proposal. My name is Mauricio Salatino and I work for the Knative community, I believe that we are evaluating the same topic for Knative and I was wondering if someone interested in workflows in DAPR is going to KubeCon EU in Valencia, Spain, it will be great to catch up and see if we can align some of these initiatives. Feel free to drop me a message here or on Twitter @salaboy

Hi folks, this looks like a great proposal. My name is Mauricio Salatino and I work for the Knative community, I believe that we are evaluating the same topic for Knative and I was wondering if someone interested in workflows in DAPR is going to KubeCon EU in Valencia, Spain, it will be great to catch up and see if we can align some of these initiatives. Feel free to drop me a message here or on Twitter @salaboy

Most Dapr maintainers and STC members won't be attending KubeCon EU physically to the best of my knowledge. It would be best to schedule a virtual call to discuss this.

That sounds good to me. The week after KubeCon might be a good option for me at least, would that work?

That sounds good to me. The week after KubeCon might be a good option for me at least, would that work?

Sounds good. I'll reach out on Twitter after I circle back with the relevant people on the Dapr side.

Amazing to see this proposal! Big yes to it!!

Steering Committee Update

The Dapr STC voted in favor of accepting this proposal and the Workflows building block.

What's the timescales looking like for getting something in hand, even if its just an alpha build that we can play around with? Thanks

What's the timescales looking like for getting something in hand, even if its just an alpha build that we can play around with? Thanks

Since this proposal is accepted by STC, we will start to design and try to bring the first simple, yet runnable demo to quickly verify our ideas and thus it will be a baseline to us for more deep and detail discussion.

For awareness, I've started working on a reference implementation for the embedded workflow engine. If you're interested in following along, it can be found here.

Note that the reference implementation is for demonstrating the feasibility of building a reliable Durable Task-based workflow engine backed by Dapr Actors. It's written in C#, uses the Dapr Actors SDK for .NET, and will run outside the Dapr sidecar. It's not intended to be used in the final implementation. However, it will support the same gRPC contract and can therefore be used to build/test Dapr Workflow SDK implementations in parallel with the actual embedded engine development.

This issue has been automatically marked as stale because it has not had activity in the last 60 days. It will be closed in the next 7 days unless it is tagged (pinned, good first issue, help wanted or triaged/resolved) or other activity occurs. Thank you for your contributions.

bump so this doesn't go stale

Sorry for the radio silence. I will partially blame summer vacation schedules. :)

Just as an FYI to interested folks, some initial POC work has been completed and we're hoping to share progress and maybe do some demos at an upcoming Dapr Community Call.

For anyone who may have missed the community call, you can see a recording of the Dapr Workflow POCs using this link: https://www.youtube.com/watch?v=8Aj1WUzVvGs&t=115s.

@cgillum

In this proposal, will Workflows follow the same guidance as Azure Durable Functions regards recommending that any long-running / compute intensive operations are not done in the workflow?

@cgillum
Curious about the retry mechanics. In the demo you show multiple stages as part of the workflow (process order, approve, notify, etc...). If it succeeds in the first operation but fails in the next, will the workflow be tried as a whole again - all steps, including the ones that have succeeded?

@tstojecki

I'm not an authority on this, but theres a couple of ways this could go...

  1. It is up to your user code to try {} catch {} around any Operation that may fail. In the catch {} block perform any necessary compensation around that Operation. An example compensation which might be perform the same operation again aka. a manual retry.

  2. The developer/operator applies custom resiliency policies in Dapr to express the retry mechanism at the CRD/infrastructure.

  3. A combination of both 2 and 3 i.e. Apply resiliency policies first, and then fall back to Exception handling if the resiliency policy is exhausted/unsuccessful.


If Dapr Workflows follows the tried and tested strategy of Azure Durable Functions, then any unhandled/uncaught Exception will put the Workflow into a failed state.

In this failed state you can either :

  1. Restart the workflow from the very beginning, essentially a fresh start. All operations/steps will be executed again regardless of previous attempts.
  2. Restart the workflow from the last known good state, aka rewind to just before the Operation failure, and attempt to try again.

In this proposal, will Workflows follow the same guidance as Azure Durable Functions regards recommending that any long-running / compute intensive operations are not done in the workflow?

Yes, for code-based workflows using the built-in engine, the same basic rules will apply: workflow code should be fully deterministic and should externalize compute intensive processing when possible.

@tstojecki regarding your question about the demo:

Curious about the retry mechanics. In the demo you show multiple stages as part of the workflow (process order, approve, notify, etc...). If it succeeds in the first operation but fails in the next, will the workflow be tried as a whole again - all steps, including the ones that have succeeded?

Just to add to what @olitomlinson said, there are two types of "failures" to consider:

  1. Application failures - for example, a service invocation returns an HTTP 500 error
  2. Infrastructure failures - for example, a node crashes and brings down your workflow process (or brings down the process of a service you were invoking).

In the first case (application failures), retry policies will be governed by retry policies. Custom resiliency policies will definitely apply and you'll also be able to apply custom logic error handling/retry logic directly in your workflow code using normal error handling constructs, like try/catch (more details on this to come).

In the second case (infrastructure failures), retries will be automatic. For example, if your workflow invokes a service and the node hosting that service crashes, the service invocation will be retried automatically. Similarly, if the node hosting the workflow crashes, the workflow will be restarted automatically and will resume from where it left off. Technically, the workflow will restart its execution from the beginning, but any operations (service invocation, pub/sub, etc.) that were already executed will be skipped and only new operations (the ones that weren't yet started) will be executed.

Technically, the workflow will restart its execution from the beginning, but any operations (service invocation, pub/sub, etc.) that were already executed will be skipped and only new operations (the ones that weren't yet started) will be executed.

Interesting, thanks!

@olitomlinson it sounds like it won't rerun successfully executed operations

Restart the workflow from the very beginning, essentially a fresh start. All operations/steps will be executed again regardless of previous attempts.

Just an update for those who may be interested: the first iteration of the Durable Task Framework clone for Go is here: https://github.com/microsoft/durabletask-go.

This is just the core engine, the gRPC contract (for the runtime), and the core engine abstractions. I verified that it's compatible with the existing gRPC-based Durable Task SDK for .NET (the same one I used as the basis for my POC demo) by running all the existing Durable Task integration tests and pointing them at this engine. This is part 1 of delivering the embedded Dapr Workflow engine.

Part 2 will include the full Dapr integration. The code above doesn't include anything related to Dapr - it's pure Durable Task Framework stuff. In the next phase of the engine work, the plan is to import the above package as a dependency into the Dapr GitHub repo (starting in the feature/workflows branch of dapr/dapr) and implement the Actor-based backend using the Dapr Go SDK. Once this is done, we should be able to faithfully reproduce the previous demo but using the real Dapr sidecar without any POC bits or extra sidecars.

Technically, the workflow will restart its execution from the beginning, but any operations (service invocation, pub/sub, etc.) that were already executed will be skipped and only new operations (the ones that weren't yet started) will be executed.

Interesting, thanks!

@olitomlinson it sounds like it won't rerun successfully executed operations

Restart the workflow from the very beginning, essentially a fresh start. All operations/steps will be executed again regardless of previous attempts.

Sorry @tstojecki , I think we might be confusing two different things here.

I'm going to write here about what Azure Durable Functions does, and then @cgillum can fact-check if this is proposed to be the same for the embedded workflow engine in Dapr, or not.


If there is a transient infrastructure failure/node failure, this will not put the workflow into an terminal status of failed - the workflow runtime will implicitly attempt to bring the workflow back online and resume processing from where the failure occurred.

The alternative transient failure mode is when user-code throws an Exception, which is handled manually by the developer, and/or handled automatically by any resiliency policy. If the exception handling is exhausted, the expectation is that the workflow would transition into a terminal status of failed.

Given a failed state, at this point you as the developer will have the option of restarting the workflow from the beginning (all operations will be re-executed) or rewinding the workflow into its last known good state, and then resuming.

@cgillum As per our brief chat on Twitter, has there been a direction set for the isolation/scoping of registered workflows in a cluster?

Consider the following line where a workflow named ProcessOrder is registered

    options.RegisterWorkflow<OrderPayload, OrderResult>("ProcessOrder", implementation: async (context, input) => { ... }

I would consider the registration to have a scope that follows the convention of :

{namespace}.{dapr_app_id}.ProcessOrder

Such that two Apps with the same dapr App ID on different namespaces residing in the same cluster/dapr environment would not conflict/interfere with each others operation in any way.

Example :

Namespace-A.App-Foo.ProcessOrder

Would not conflict with

Namespace-B.App-Foo.ProcessOrder


This would be inline with recent changes in 1.8 to support isolation of State Management and upcoming changes in 1.9 to support Isolation of Pub Sub

@olitomlinson the workflow engine depends on the state management organization of the underlying actors subsystem. Workflows will therefore inherit the isolation behavior of actors. We’re otherwise not making any explicit decisions about isolation for workflows. If actors are improved with better isolation guarantees, then workflows should be able to easily inherit the same improvements.

@cgillum An Actor is essentially isolated by its Actor ID. Is there any reason that the namespace and app Id can't be included in the Actor ID to get that isolation?

That's certainly an option, though I think it would be cleaner if the actor subsystem could take care of namespace prefixing for us. This is already being done for app IDs based on what I see generated in the state stores for actors, so there shouldn't be any conflict across different (uniquely named) apps.

In the current implementation (that's in active development) the actors simply use the instance IDs of the workflows, which are either user specified or randomly generated. You're right that we could theoretically prepend the namespace to all the actor IDs. However, I'm not yet familiar with namespaces in Dapr and would need to check to see whether we have access to the namespace identifier in all the needed places.

That's certainly an option, though I think it would be cleaner if the actor subsystem could take care of namespace prefixing for us.

It definitely would be cleaner, you're right there.

I only raise this because we can't use Dapr Actors directly in our product due to the inability to have namespace isolation of Actors. I would hate that Dapr Workflows falls into the same trapping, and once again, my team are unable to access another really valuable programming model.

JFYI, first PR into the feature/workflows branch has been merged: #5301.

It introduces an internal actor concept which is used by the new durable task-based workflow engine. More PRs will be published over the next few weeks that flesh out the full workflow engine feature set.

Linking dotnet-sdk proposal for dapr workflows:
#5314

Frieds, If you're interested in the Serverless Workflow specification for this effort, please let us know! We are at the CNCF slack, #serverless-workflow channel.

@msfussell

FYI that the search term "dapr workflows" in google is probably returning less favourable results for when 1.10 ships - I suspect it would be desirable to be returning the new Workflow building block documentation, rather than the sandbox project?

image

Frieds,At present, I have implemented the go runtime based on the ServerlessWorkflow Spec, and I have contributed to ASF(it will have a separate code warehouse under Apache). I also want to communicate with you if there is any intention to cooperate

This issue has been automatically marked as stale because it has not had activity in the last 60 days. It will be closed in the next 7 days unless it is tagged (pinned, good first issue, help wanted or triaged/resolved) or other activity occurs. Thank you for your contributions.