Investigate storing release order of activities in cluster-workflows

Question

Investigate storing release order of activities in cluster-workflows

mattiamanzati opened this issue 5 months ago · comments

What is the problem this feature would solve?

If the user uses combinators such as raceFirst in the workflow body, the determinism may be broken by mis-timing of already executed activities returning in different order than previous execution.

What is the feature you are proposing to solve the problem?

While it would be very easy to store the "return-order" of the activities, and while re-spawning the workflow ensure it is the same by using deferreds and replaying the exact release order, this raises an issue when workflow are upgraded and new activities are interleaved between already executed ones.

Let's say whe have the following:

async function workflowV1(){
  await doA()
  await doB()
  await doC()
}

only doA and doB is executed, then server crash.
Upon restart, the workflow is now updated and defined as follows (pseudo-async-await):

async function workflowV2(){
  await doA()
  await doZ()
  await doB()
  await doC()
}

now the restart of the workflow should:

see request of execute doA
reply doA with previous return value
encounter request of doZ that was never executed
attempt to execute doZ
complete doZ
see request of execute doB
reply doB with previous return value

or is it generally best break with never completing workflow?
because originally the user should have written something like:

async function workflowV2(){
  await doA()
  if(workflowVersion >= 2){
    await doZ()
  }
  await doB()
  await doC()
}

What alternatives have you considered?

No response

Mattia Manzati · Answer 1 · Wed Mar 20 2024 22:39:37 GMT+0800 (China Standard Time)

Futher considerations:
defecting with a "determinismbrokenexception" is the best, it would still allow to consume in a linear way the history of the workflow, there may still be situations where we cannot detect broken determinism, but that is something the user should not break.

Stéphane Le Dorze · Answer 2 · Wed Mar 20 2024 22:51:44 GMT+0800 (China Standard Time)

If we depend on the developer to identify potential nondeterminism, the developer must have the tools to control its behavior according to their preferences.
As an exploratory solution, it seems logical that tagging the results of activities (either through manual tagging or by combining the activity name with a hash of the parameters) could be helpful in this context.

(Least surprise, easy to reason about often goes hand in hand with simplicity or being explicit)

Michael Arnaldi · Answer 3 · Wed Mar 20 2024 23:17:06 GMT+0800 (China Standard Time)

If we depend on the developer to identify potential nondeterminism, the developer must have the tools to control its behavior according to their preferences. As an exploratory solution, it seems logical that tagging the results of activities (either through manual tagging or by combining the activity name with a hash of the parameters) could be helpful in this context.

(Least surprise, easy to reason about often goes hand in hand with simplicity or being explicit)

An activity is identified by an id, not by the parameters. Inputs are only needed at the workflow level

Stéphane Le Dorze · Answer 4 · Wed Mar 20 2024 23:21:03 GMT+0800 (China Standard Time)

If we depend on the developer to identify potential nondeterminism, the developer must have the tools to control its behavior according to their preferences. As an exploratory solution, it seems logical that tagging the results of activities (either through manual tagging or by combining the activity name with a hash of the parameters) could be helpful in this context.
(Least surprise, easy to reason about often goes hand in hand with simplicity or being explicit)

An activity is identified by an id, not by the parameters. Inputs are only needed at the workflow level

I was specifically referring to result retrieval at workflow level during result replay to detect mismatchs

Mattia Manzati · Answer 5 · Thu Mar 21 2024 00:12:50 GMT+0800 (China Standard Time)

during replay the workflow engine will reuse the provided activities id, see https://github.com/Effect-TS/cluster/blob/main/packages/cluster-pg/examples/simple-workflow.ts#L40 as example.
The developer may blend into the id parameters of the activity, at his own risk

Mattia Manzati · Answer 6 · Thu Mar 21 2024 00:34:36 GMT+0800 (China Standard Time)

additionally: in the log we should persist both the order of activities being requested and completed, so we can detect determinism being broken both on request and completion order

Stéphane Le Dorze · Answer 7 · Thu Mar 21 2024 01:45:42 GMT+0800 (China Standard Time)

Ok got it, other idea;

persist version as part of Activities.
have a getWorkflowVersion() which returns the version stored with next Activity (or actual workflow version if no other activity in the history).
This would enable determinism.

We can still expose workflowVersion for people knowing what they do..

On first run

async function workflowV2(){
  await doA()
  if(getWorkflowVersion() >= 2){ // returns 2
    await doZ()
  }
  await doB()
  await doC()
}

On second run if previous run was on version 1 and went to doA()

async function workflowV2(){
  await doA()
  if(getWorkflowVersion() >= 2){ // would return 2
    await doZ()
  }
  await doB()
  await doC()
}

On second run if previous run was on version 1 and went to doB()

async function workflowV2(){
  await doA()
  if(getWorkflowVersion() >= 2){ // would return 1
    await doZ()
  }
  await doB()
  await doC()
}

Mattia Manzati · Answer 8 · Thu Mar 21 2024 18:14:55 GMT+0800 (China Standard Time)

Uhm, I think this may be kind of more convoluted.
We could start with an error such:

While executing workflow [id] version [version], we expected the activity [activity-id] to be requested, but instead [new-activity-id] was requested. This is usually a sign that determinism has been broken somehow. Please update your workflow code to handle this new versio.

Stéphane Le Dorze · Answer 9 · Thu Mar 21 2024 18:23:59 GMT+0800 (China Standard Time)

But, doing so would make this impossible:
as it would fail if previous execution went to doB

async function workflowV2(){
  await doA()
  if(workflowVersion >= 2){
    await doZ() // unconsistency....
  }
  await doB()
  await doC()
}

Michael Arnaldi · Answer 10 · Thu Mar 21 2024 18:58:30 GMT+0800 (China Standard Time)

But, doing so would make this impossible: as it would fail if previous execution went to doB
async function workflowV2(){
  await doA()
  if(workflowVersion >= 2){
    await doZ() // unconsistency....
  }
  await doB()
  await doC()
}

if previous execution went to doB subsequent executions have to go to doB, that's the point.

async function workflowV2(){
  await doA()
  if(workflowVersion <= 2){
    await doB()
  }
  await doC()
}

This will make any version <=2 perform doB and subsequent versions will skip doB

Stéphane Le Dorze · Answer 11 · Thu Mar 21 2024 19:04:57 GMT+0800 (China Standard Time)

Ok, that way it works indeed!

I though this example was the target:

async function workflowV2(){
  await doA()
  if(workflowVersion >= 2){
    await doZ()
  }
  await doB()
  await doC()
}```

Michael Arnaldi · Answer 12 · Thu Mar 21 2024 19:07:34 GMT+0800 (China Standard Time)

Ok, that way it works indeed!

I though this example was the target:

async function workflowV2(){
  await doA()
  if(workflowVersion >= 2){
    await doZ()
  }
  await doB()
  await doC()
}```

the point is allowing the user to have enough information to correctly make new workflows act like old workflows on existing event traces, the strategy is very similar to what Temporal does.

cc @mattiamanzati we should probably provide an easy way for a user to store workflow version and retrieve it

Mattia Manzati · Answer 13 · Fri Mar 22 2024 02:33:39 GMT+0800 (China Standard Time)

Yeah indeed, and store that version in the Attempted journal entries, so that we get audit about workflow initial version and eventual executions over new versions.

Stéphane Le Dorze · Answer 14 · Fri Mar 22 2024 04:49:35 GMT+0800 (China Standard Time)

About workflow replay and non determinism.

Concurrent behaviors might execute effect in non deterministic order
If we're leaning towards enabling their usage, here's an evaluation strategy that might do the job

A) The workflow (re)execute but suspend effect execution on Activities (like Request does)
B) When no progress is possible, it applies* the next replay result from history and continue the process on (A),
C) When there's no more replay results to apply, it then executes the pending Activities and save results.

(*) If during B, it cannot apply the result to a pending Activity, this means we've detected some inconsistency

I think doing so, we could use concurrent / parallel variants (forEach, etc..) and still detect non determinism..

N.B.: there's also the possibility to have Workflow specific constructs (and interpreter), which may provide more direct control and enable higher level workflow constructs

Michael Arnaldi · Answer 15 · Mon Mar 25 2024 19:33:48 GMT+0800 (China Standard Time)

can you make a diagram that shows the proposed method? as far as I understand the proposed method doesn't work as we can't just continue without processing an activity.

regardless using parallel operations should already be fine, forks should be considered like child workflows

Stéphane Le Dorze · Answer 16 · Sun Mar 31 2024 07:30:03 GMT+0800 (China Standard Time)

This was the idea:

sequenceDiagram
    Workflow->>Activity1: Suspended
    Workflow->>Activity2: Suspended
    Workflow->>Activity3: Suspended
    Workflow->>Workflow: No more progress possible read event
    Workflow->>Activity2: Fulfill with Result2 with event
    Workflow->>Workflow: No more progress possible read event
    Workflow->>Activity3: Fulfill with Result3 with event
    Workflow->>Workflow: No more progress possible no more event available
    Workflow->>Activity1: Execute the Suspended Activity 
    Workflow->>Activity4: etc...

regardless using parallel operations should already be fine, forks should be considered like child workflows

Would means that we must implement and use a special interpreter for workflow to create subworkflows when forking..
Also this does not, per se, resolve the out-of-order activity completion problem

Either ways these solutions require modifying the way we interpret effects

A less intrusive solution (not requiring to change interpreters) would be to introduce workflow specific combinators to fork subworkflows, iterate over collections with subworkflows, race subworkflows, etc..

Note: I think that conceptually*, workflow behaviour should be undistinguishable from how it would behave if workflows were executed as sync code:

re-executing from the whole journal each time an activity / subworkflow ends.

hitting workflow combinators - and activities - meaning we're giving back execution control flow to the workflow executor
(*) conceptually because it might be inefficient compared to optimistically keep it in memory

Michael Arnaldi · Answer 17 · Wed Apr 03 2024 01:00:54 GMT+0800 (China Standard Time)

This was the idea:
sequenceDiagram
    Workflow->>Activity1: Suspended
    Workflow->>Activity2: Suspended
    Workflow->>Activity3: Suspended
    Workflow->>Workflow: No more progress possible read event
    Workflow->>Activity2: Fulfill with Result2 with event
    Workflow->>Workflow: No more progress possible read event
    Workflow->>Activity3: Fulfill with Result3 with event
    Workflow->>Workflow: No more progress possible no more event available
    Workflow->>Activity1: Execute the Suspended Activity 
    Workflow->>Activity4: etc...
Loading
regardless using parallel operations should already be fine, forks should be considered like child workflows

Would means that we must implement and use a special interpreter for workflow to create subworkflows when forking.. Also this does not, per se, resolve the out-of-order activity completion problem

Either ways these solutions require modifying the way we interpret effects

A less intrusive solution (not requiring to change interpreters) would be to introduce workflow specific combinators to fork subworkflows, iterate over collections with subworkflows, race subworkflows, etc..

Note: I think that conceptually*, workflow behaviour should be undistinguishable from how it would behave if workflows were executed as sync code:

re-executing from the whole journal each time an activity / subworkflow ends.

hitting workflow combinators - and activities - meaning we're giving back execution control flow to the workflow executor
(*) conceptually because it might be inefficient compared to optimistically keep it in memory

Not sure which "interpreter" you are referring to, a workflow is just an Effect there is no interpreter, the Effect's interpreter is the Fiber and it is final meaning that combinators are implemented on top of the fiber structure, so you can't swap interpreter.

As far as I understand there is no way of detecting that "no more progress possible"

Stéphane Le Dorze · Answer 18 · Fri Apr 05 2024 23:27:02 GMT+0800 (China Standard Time)

I though you had to change the interpreter to support RequestResolvers but that's maybe not the case (not checked yet)

So I was referring to RequestResolver like ability to address out of order execution

suspend execution of Activities
completeEffect on already executed Activities (from the Workflow EventLog - with random access here to retrieve the matching Event)
execute non completed Activities

Michael Arnaldi · Answer 19 · Tue Apr 09 2024 16:47:23 GMT+0800 (China Standard Time)

I though you had to change the interpreter to support RequestResolvers but that's maybe not the case (not checked yet)

So I was referring to RequestResolver like ability to address out of order execution

suspend execution of Activities

completeEffect on already executed Activities (from the Workflow EventLog - with random access here to retrieve the matching Event)

execute non completed Activities

Not sure what you're referring to with "interpreter", the addition to request resolvers was directly handled in the fiber, there is no secondary interpreter, basically a new primitive and forEach handles the batching

Mattia Manzati · Answer 20 · Thu Apr 18 2024 05:06:52 GMT+0800 (China Standard Time)

Ok finally had a shot at this.
Now the runtime stores the order of starting and ending activities, that allows in replay phase to not rely on timings at all, only on sequence of the events happening.