Actyx / machines

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Brain Dump] Fate Sealing

Kelerchian opened this issue · comments

timetravel can change the outcome of a machine-runner based workflow. Such change may be unintended, even harmful when the swarm commits a crucial action because of the final state.

Fate Sealing is a mechanism for sealing a workflow so that future timetravel does not impact its history.

API

// file: "manager.ts"

// state definition
export const Done = machineProtocol
  .designEmpty("Done")
  .makeFateSealable()
  .finish()

export const Failed = machineProtocol
  .designEmpty("Failed")
  .makeFateSealable()
  .finish()

// machine usage
for await (const state of machine) {
  // other state handling here
  
  const whenFailed = state.as(Failed);
  if (whenFailed ) {
    await whenFailed.sealFate();
    break;
  }
  
  const whenDone = state.as(Done);
  if (whenDone) {
    await whenDone.sealFate();
    await doSomethingImportantUndoable()
    break;
  }
}

Under the hood

A special event is published.

The tag:

  • "[workflow_name]" (machine-runner's standard tag)
  • "[workflow_name]:[id]" (machine-runner's standard tag)
  • "machine-runner-fate-seal:[workflow_name]" (new tag marking fate-sealing)

Its payload:

{
  type: "actyx/machine-runner/fate-seal", // constant. `type` must exist here to not break current consumer's AQL
  offsetMap: OffsetMap, // last offsetmap recorded by the machine doing the sealing
  history: [EventId, EventId, ....] // ID list of non-discarded events by the machine doing the sealing
}

MachineRunner Internal Read Fate-Seal Events

  1. When a fate-seal event is detected, it will be safely parsed (non-throwing)
  2. parse failed: fate-seal event is ignored
  3. If parse is successful:
    1. machine-runner-internal is reset (back to the initial state)
    2. query will be invoked to fetch events listed in the history.
    3. events are pushed to machine-runner-internal to trigger past transitions ("caught" is not emitted during this period)
    4. new subscription is issued alongside fate-seal event's offsetMap; piped normally to machine-runner-internal (which will trigger "caught" as per the old mechanism)

Prevention of Fate Seal Duplication and Accidental Infinite Loop

Once a fate-seal is processed it should be ignored, to prevent an accidental infinite loop.
However, one more thing to prevent is a duplication caused by sealFate() called multiple time.

To prevent this, machine-runner must remember 1.) the last event Id queued into the state machine and 2.) the list of sealed eventIds. If the last event Id queued into the state machine is in the list of the list of sealed eventIds, then sealFate() is ignored and the promise returned will be instantly resolved.

Questions

When replicating a fate-seal event, is there a chance that this particular Actyx node does not have all events recorded in the history?

Update 2023-09-06:

Answers:

When replicating a fate-seal event, is there a chance that this particular Actyx node does not have all events recorded in the history?

No. Actyx's events dissemination is epidemic. Every Actyx node that owns a fate-seal event will also have all events whose ID is recorded in its history.

If it were this simple then we’d have created something that the CAP theorem says cannot exist ;-)

The main problem is concurrent sealing: node A has some events and seals some history, node B has somewhat different events and seals its history, now we have conflicting notions of what has been sealed, so eventual consensus means that one of the nodes will see its seal broken by the other once all events are merged.


The only way that such a feature could work is by employing a consensus algorithm of some kind. We could use an existing one, e.g. to update a “minimum Lamport timestamp” which shall then prevent any node to publish new events earlier than this timestamp — at this point, we could use that minimum timestamp to see which parts of the history are set in stone and can no longer change.

Instead of using Paxos or Raft we could add a swarm joining procedure (currently new nodes just start publishing whatever they want; we’d have to restrict new nodes to respect the current minimum Lamport timestamp). Then we could track the minimum Lamport clock seen on any protocol participant to notice when a decision is final.

No matter how we do it: under a suitable network partition this will not advance, so it will block progress of any application that needs it. This is a fundamental property of distributed systems that we cannot get around.

so eventual consensus means that one of the nodes will see its seal broken by the other once all events are merged

Yes. Therefore, this must not be offered as a fully guaranteed sealing ("fate sealing" might be a bad name after all), but rather a higher level and more precise control over who can win (i.e. at least if A seals and B do not, A wins), before falling back to the granular and more chaotic consistency from event merging.

I still tend to avoid incorporating consensus algorithms directly inside machine runner. Instead, I see them as something that should be combined/composed a la the composition (in OOP in contrast to the inheritance) and let the user decide which trumps which.

I agree on the composition of different mechanisms (one available, one consistent). Between those, my gut feeling is that there is no room in our mental complexity budget for a “soft sealing” mechanism you describe above — people would probably not be able to use it correctly.

That is probably right.
I will shelf this since there is no immediate need for it.

As for hard-sealing, it is better to do it outside of the machine-runner, and is composed with it rather than a part of it.