Cluster Readiness of OE-Workflow

Current execution of OE-Workflow in cluster mode :

  • If the number of nodes in cluster remains static or scales up, OE-Workflow should theoretically work fine. Although new nodes will trigger the Workflows that are in pending state in parallel to already existing nodes. And, there will be race condition to complete the workflow task first, and the successful node will continue the workflow from there on. Although, this behaviour will change once OE-Workflow becomes cluster ready.
  • But, If we scale down, the workflow processes running on existing nodes will die and should be somehow continued on the existing nodes.

Broad Overview of Workflow Instance Life Cycle

  • Once, the Workflow Instance is triggered, Workflow Instance is initiated with some initial state (with ‘Start Event’ token), input process-variables, message etc. In the after-save, start event token is thrown on token arrived event.
  • After we trigger a workflow, series of updates happen to the Workflow Instance. These updates basically happen as follows :
    • For the token arrived event, execute the implementation logic of that particular Workflow Activity.
    • Calculate the state changes that have to be applied to Workflow Instance, this includes changes like marking token as complete, updating process-variables, messages, etc.
    • Then we find the next flow objects that have to executed after the current Workflow Activity. Create tokens for these and mark them as pending and add these also to state change object.
    • Finally, we apply the state changes to Workflow Instance and try to make an update. In case state change fails due to parallel updates on the same Workflow Instance, we fetch the latest Workflow Instance, again try to apply changes and commit.
    • Once, Workflow Instance update happens successfully, we emit tokens on token arrived event.

Possible Approach

According to the current implementation, if we don’t have a User Task in a particular workflow, all the Workflow Activities will happen on the same node.

But, this needs to change as workflow activities of a particular workflow are independent entities (while having access to shared memory as process variables which are in database) and for our application to be truly distributed we need to execute Workflow activites of same workflow on different nodes. This is where cluster events, comes into picture. Currently, our token arrived event is a node.js event, instead we need to have token arrived event as a global cluster event so that any node can pick up the event and start executing. Also, broadcaster supports emitting a event which is listened on by all the nodes. For Workflow specifically, we will need some kind of queue system (plus some kind of service discovery, if all nodes don’t serve Workflow Service), which will take events and send those to any node serving Workflow Service, should serve Workflow Activity Request and send an acknowledgement to the global queue. Only when the acknowledgement is recieved by the queue, the workflow activity is to be considered as complete and event should be moved out of queue. In this scenario, if nodes serving Workflow Service go down, queue has to understand it, and retry emitting tokens on different nodes after a certain timeout or say poll the node serving for the event after interval to understand workflow service nodes are still up.