Workflow Recovery Mechanism

Business process and workflows are by nature long running processes. When workflow engine is restarted, the chances are that some or the other flow instance may have stopped without completing entire flow. The oeWorkflow engine recovers and resumes the pending process instances during startup.

A boot script picks up all the pending processes and checks the state of the tokens that are also in pending (or running) state. Based on the implementation of nodes they are appropriately retriggered and process is continued.

Recovery process can not identify exact point of failure and flow node is the atomic unit of safe checkpoint. Normally, this is only problematic for remote service calls. The workflow engine, maintains and carries a correlationId property that can be used make safe idempotent calls to remote service, provided the service supports idempotent behavior.

In a cluster environment, exactly one instance would perform the recovery. The container instance acquires a recovery lock before running recovery (through master-job-executor).

However, when it sees a running process or token, recovery can not be sure if it is running on other container or it is left zombie by a stopped container. Fortunately, all the BPMN nodes (except User Task and long running timers) complete their execution pretty quickly. The recovery process observes if a running token (other than User Task and Timer) remains running for a while. If it changes the state in any manner during observation period, the recovery ignores such instances. If the token remain running during the observation, then the recovery process will re-execute it.