Effect-TS / cluster

Home Page:https://effect.website

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Investigate storing release order of activities in cluster-workflows

mattiamanzati opened this issue · comments

What is the problem this feature would solve?

If the user uses combinators such as raceFirst in the workflow body, the determinism may be broken by mis-timing of already executed activities returning in different order than previous execution.

What is the feature you are proposing to solve the problem?

While it would be very easy to store the "return-order" of the activities, and while re-spawning the workflow ensure it is the same by using deferreds and replaying the exact release order, this raises an issue when workflow are upgraded and new activities are interleaved between already executed ones.

Let's say whe have the following:

async function workflowV1(){
  await doA()
  await doB()
  await doC()
}

only doA and doB is executed, then server crash.
Upon restart, the workflow is now updated and defined as follows (pseudo-async-await):

async function workflowV2(){
  await doA()
  await doZ()
  await doB()
  await doC()
}

now the restart of the workflow should:

  • see request of execute doA
  • reply doA with previous return value
  • encounter request of doZ that was never executed
  • attempt to execute doZ
  • complete doZ
  • see request of execute doB
  • reply doB with previous return value

or is it generally best break with never completing workflow?
because originally the user should have written something like:

async function workflowV2(){
  await doA()
  if(workflowVersion >= 2){
    await doZ()
  }
  await doB()
  await doC()
}

What alternatives have you considered?

No response

Futher considerations:
defecting with a "determinismbrokenexception" is the best, it would still allow to consume in a linear way the history of the workflow, there may still be situations where we cannot detect broken determinism, but that is something the user should not break.

If we depend on the developer to identify potential nondeterminism, the developer must have the tools to control its behavior according to their preferences.
As an exploratory solution, it seems logical that tagging the results of activities (either through manual tagging or by combining the activity name with a hash of the parameters) could be helpful in this context.

(Least surprise, easy to reason about often goes hand in hand with simplicity or being explicit)

If we depend on the developer to identify potential nondeterminism, the developer must have the tools to control its behavior according to their preferences. As an exploratory solution, it seems logical that tagging the results of activities (either through manual tagging or by combining the activity name with a hash of the parameters) could be helpful in this context.

(Least surprise, easy to reason about often goes hand in hand with simplicity or being explicit)

An activity is identified by an id, not by the parameters. Inputs are only needed at the workflow level

If we depend on the developer to identify potential nondeterminism, the developer must have the tools to control its behavior according to their preferences. As an exploratory solution, it seems logical that tagging the results of activities (either through manual tagging or by combining the activity name with a hash of the parameters) could be helpful in this context.
(Least surprise, easy to reason about often goes hand in hand with simplicity or being explicit)

An activity is identified by an id, not by the parameters. Inputs are only needed at the workflow level

I was specifically referring to result retrieval at workflow level during result replay to detect mismatchs

during replay the workflow engine will reuse the provided activities id, see https://github.com/Effect-TS/cluster/blob/main/packages/cluster-pg/examples/simple-workflow.ts#L40 as example.
The developer may blend into the id parameters of the activity, at his own risk

additionally: in the log we should persist both the order of activities being requested and completed, so we can detect determinism being broken both on request and completion order

Ok got it, other idea;

  • persist version as part of Activities.
  • have a getWorkflowVersion() which returns the version stored with next Activity (or actual workflow version if no other activity in the history).
    This would enable determinism.

We can still expose workflowVersion for people knowing what they do..

On first run

async function workflowV2(){
  await doA()
  if(getWorkflowVersion() >= 2){ // returns 2
    await doZ()
  }
  await doB()
  await doC()
}

On second run if previous run was on version 1 and went to doA()

async function workflowV2(){
  await doA()
  if(getWorkflowVersion() >= 2){ // would return 2
    await doZ()
  }
  await doB()
  await doC()
}

On second run if previous run was on version 1 and went to doB()

async function workflowV2(){
  await doA()
  if(getWorkflowVersion() >= 2){ // would return 1
    await doZ()
  }
  await doB()
  await doC()
}

Uhm, I think this may be kind of more convoluted.
We could start with an error such:

While executing workflow [id] version [version], we expected the activity [activity-id] to be requested, but instead [new-activity-id] was requested. This is usually a sign that determinism has been broken somehow. Please update your workflow code to handle this new versio.

But, doing so would make this impossible:
as it would fail if previous execution went to doB

async function workflowV2(){
  await doA()
  if(workflowVersion >= 2){
    await doZ() // unconsistency....
  }
  await doB()
  await doC()
}

But, doing so would make this impossible: as it would fail if previous execution went to doB

async function workflowV2(){
  await doA()
  if(workflowVersion >= 2){
    await doZ() // unconsistency....
  }
  await doB()
  await doC()
}

if previous execution went to doB subsequent executions have to go to doB, that's the point.

async function workflowV2(){
  await doA()
  if(workflowVersion <= 2){
    await doB()
  }
  await doC()
}

This will make any version <=2 perform doB and subsequent versions will skip doB

Ok, that way it works indeed!

I though this example was the target:

async function workflowV2(){
  await doA()
  if(workflowVersion >= 2){
    await doZ()
  }
  await doB()
  await doC()
}```

Ok, that way it works indeed!

I though this example was the target:

async function workflowV2(){
  await doA()
  if(workflowVersion >= 2){
    await doZ()
  }
  await doB()
  await doC()
}```

the point is allowing the user to have enough information to correctly make new workflows act like old workflows on existing event traces, the strategy is very similar to what Temporal does.

cc @mattiamanzati we should probably provide an easy way for a user to store workflow version and retrieve it

Yeah indeed, and store that version in the Attempted journal entries, so that we get audit about workflow initial version and eventual executions over new versions.

About workflow replay and non determinism.

Concurrent behaviors might execute effect in non deterministic order
If we're leaning towards enabling their usage, here's an evaluation strategy that might do the job

A) The workflow (re)execute but suspend effect execution on Activities (like Request does)
B) When no progress is possible, it applies* the next replay result from history and continue the process on (A),
C) When there's no more replay results to apply, it then executes the pending Activities and save results.

(*) If during B, it cannot apply the result to a pending Activity, this means we've detected some inconsistency

I think doing so, we could use concurrent / parallel variants (forEach, etc..) and still detect non determinism..

N.B.: there's also the possibility to have Workflow specific constructs (and interpreter), which may provide more direct control and enable higher level workflow constructs

can you make a diagram that shows the proposed method? as far as I understand the proposed method doesn't work as we can't just continue without processing an activity.

regardless using parallel operations should already be fine, forks should be considered like child workflows

This was the idea:

sequenceDiagram
    Workflow->>Activity1: Suspended
    Workflow->>Activity2: Suspended
    Workflow->>Activity3: Suspended
    Workflow->>Workflow: No more progress possible read event
    Workflow->>Activity2: Fulfill with Result2 with event
    Workflow->>Workflow: No more progress possible read event
    Workflow->>Activity3: Fulfill with Result3 with event
    Workflow->>Workflow: No more progress possible no more event available
    Workflow->>Activity1: Execute the Suspended Activity 
    Workflow->>Activity4: etc...
Loading

regardless using parallel operations should already be fine, forks should be considered like child workflows

Would means that we must implement and use a special interpreter for workflow to create subworkflows when forking..
Also this does not, per se, resolve the out-of-order activity completion problem

Either ways these solutions require modifying the way we interpret effects


A less intrusive solution (not requiring to change interpreters) would be to introduce workflow specific combinators to fork subworkflows, iterate over collections with subworkflows, race subworkflows, etc..

 Note: I think that conceptually*, workflow behaviour should be undistinguishable from how it would behave if workflows were executed as sync code:

  • re-executing from the whole journal each time an activity / subworkflow ends.
  • hitting workflow combinators - and activities - meaning we're giving back execution control flow to the workflow executor
    (*) conceptually because it might be inefficient compared to optimistically keep it in memory

This was the idea:

sequenceDiagram
    Workflow->>Activity1: Suspended
    Workflow->>Activity2: Suspended
    Workflow->>Activity3: Suspended
    Workflow->>Workflow: No more progress possible read event
    Workflow->>Activity2: Fulfill with Result2 with event
    Workflow->>Workflow: No more progress possible read event
    Workflow->>Activity3: Fulfill with Result3 with event
    Workflow->>Workflow: No more progress possible no more event available
    Workflow->>Activity1: Execute the Suspended Activity 
    Workflow->>Activity4: etc...
Loading

regardless using parallel operations should already be fine, forks should be considered like child workflows

Would means that we must implement and use a special interpreter for workflow to create subworkflows when forking.. Also this does not, per se, resolve the out-of-order activity completion problem

Either ways these solutions require modifying the way we interpret effects

A less intrusive solution (not requiring to change interpreters) would be to introduce workflow specific combinators to fork subworkflows, iterate over collections with subworkflows, race subworkflows, etc..

Note: I think that conceptually*, workflow behaviour should be undistinguishable from how it would behave if workflows were executed as sync code:

  • re-executing from the whole journal each time an activity / subworkflow ends.
  • hitting workflow combinators - and activities - meaning we're giving back execution control flow to the workflow executor
    (*) conceptually because it might be inefficient compared to optimistically keep it in memory

Not sure which "interpreter" you are referring to, a workflow is just an Effect there is no interpreter, the Effect's interpreter is the Fiber and it is final meaning that combinators are implemented on top of the fiber structure, so you can't swap interpreter.

As far as I understand there is no way of detecting that "no more progress possible"

I though you had to change the interpreter to support RequestResolvers but that's maybe not the case (not checked yet)

So I was referring to RequestResolver like ability to address out of order execution

  • suspend execution of Activities
  • completeEffect on already executed Activities (from the Workflow EventLog - with random access here to retrieve the matching Event)
  • execute non completed Activities

I though you had to change the interpreter to support RequestResolvers but that's maybe not the case (not checked yet)

So I was referring to RequestResolver like ability to address out of order execution

  • suspend execution of Activities
  • completeEffect on already executed Activities (from the Workflow EventLog - with random access here to retrieve the matching Event)
  • execute non completed Activities

Not sure what you're referring to with "interpreter", the addition to request resolvers was directly handled in the fiber, there is no secondary interpreter, basically a new primitive and forEach handles the batching

Ok finally had a shot at this.
Now the runtime stores the order of starting and ending activities, that allows in replay phase to not rely on timings at all, only on sequence of the events happening.