Architecture¶
This project implements a DBOS durable workflow engine inside Ignition, using Postgres as the system-of-record and a Gateway timer script dispatcher + worker pools to execute work. The end goal is the same as DBOS:
- You can start long-running workflows from decorated functions
- Work can retry on failures
- Operators can observe, control, and intervene (hold/resume/stop, prompts, etc.)
The project also deals with script interpreter reloads (more here).
If you want the detailed thread/claim/lifecycle model, read Concurrency + Lifecycle.
Workflow and steps¶
- Workflows are functions that describe a long-running process.
- Steps are the units of work inside workflows that the library makes durable.
- When a workflow calls a step, library persists the step result so it can:
- replay the result on retry/restart
- avoid re-running the step if it already succeeded
- apply retry/backoff rules in a consistent way
System database¶
- workflow rows (status, inputs, timestamps, ownership)
- step outputs / attempts (call sequence, output/error, timing)
- events + streams (for observability)
- notifications/mailbox (send/recv for async coordination)
- deduplication metadata (idempotent enqueue)
Executors + queues¶
Library executors repeatedly:
- claim some ENQUEUED workflows from a queue
- mark them owned by the executor
- execute them
- update status + heartbeats
- write terminal state when done (SUCCESS/ERROR/CANCELLED)
Workflow Runtime¶
Enqueue¶
(e.i. create durable work)
A workflow request becomes a row in Postgres with status ENQUEUED.
This library has two ways to enqueue workflows.
Normal enqueue¶
Use when you can afford a DB write right now.
- API:
exchange.workflows.api.service.start(...) - Behavior: inserts a workflow as a
ENQUEUEDrow into the database.
flowchart TB
%% Nodes
ui@{ shape: rounded, label: "Perspective/Gateway<br>script" }
DB[(ENQUEUED)]
%% Flow
ui e1@-->|enqueue workflow| DB
%% Edges
e1@{ animation: slow }
Fast enqueue for tag events¶
Use when latency matters (e.g. tag event scripts) waiting on DB I/O.
- API:
exchange.workflows.api.service.enqueueInMemory(...) - Behavior: put request into in-memory queue only and tag change will proceed immediatly and not wait/hog the thread.
- DB write happens on the next gateway timer dispatch.
- Dispatch will pass through all workflow requests from the memory queue into the DB in a batch insert.
flowchart TB
%% Nodes
timer[Gateway Timer script <br><b>dispatch]
inMemQueue@{ shape: das, label: "In-memory Queue" }
tag@{ shape: rounded, label: "Tag value changed" }
DB[(ENQUEUED)]
%% Flow
tag e1@-->|enqueue workflow| inMemQueue
inMemQueue e2@-->|batch pass through| timer
timer e3@-->|enqueue workflow| DB
%% Edges
e1@{ animation: slow }
e2@{ animation: slow }
e3@{ animation: slow }
Dispatch¶
On every gateway timer execution the dispatcher is responsible for a few things:
- flush in-memory queue,
- apply capacity limits,
- claim rows,
- enforce maintenance behavior,
- and run work predictably.
flowchart TB
%% Nodes
timer[Gateway Timer script <br><b>dispatch</b> enqueue]
timerDispatch[Gateway Timer script <br><b>dispatch</b> claim]
inMemQueue@{ shape: das, label: "In-memory Queue" }
tag@{ shape: rounded, label: "Tag value changed" }
ui@{ shape: rounded, label: "Perspective/Gateway<br>script" }
DB[(ENQUEUED)]
subgraph pool [Worker Thread Pool]
thread3@{ shape: procs, label: "Worker Threads" }
end
%% Flow
tag e1@-->|enqueue workflow| inMemQueue
inMemQueue e2@-->|batch pass through| timer
timer e3@-->|enqueue workflow| DB
ui e4@-->|enqueue workflow| DB
DB e5@-->|claim workflow| timerDispatch
timerDispatch e6@-->|dispatch work to available threads| pool
%% Edges
e1@{ animation: slow }
e2@{ animation: slow }
e3@{ animation: slow }
e4@{ animation: slow }
e5@{ animation: slow }
e6@{ animation: slow }
Start execution¶
(e.i. transition to RUNNING when code actually begins)
When a worker thread picks up a claimed row:
- it transitions to
RUNNING - it sets
started_at_epoch_msat that moment - it computes
deadline_epoch_ms = started_at + timeout
Step execution¶
Inside the workflow:
- if ERROR: retry a configurable amount of times
- else: execute and persist output/error
Terminal states¶
When workflow finishes, it writes one terminal status:
- SUCCESS
- ERROR
- CANCELLED
Partition keys¶
(e.i. arbitration to a shared resource)
one workflow at a time per instrument/resource.
- DB: don’t claim if another workflow in that partition is already pending/running
- runtime: keep a local active partition set to avoid double-dispatch in-process
Ignition/Jython interpreter reload issue¶
Ignition adds one special operational hazard DBOS doesn’t have. When you save the project script library in Ignition:
- A new Jython interpreter is created.
- Old running threads are not automatically killed.
- Those old threads keep running old code until they exit.
That can create version-mixing bugs and memory leaks when old interpreter objects stay referenced.
flowchart TD
A[Designer saves project scripts] --> B[New Jython interpreter starts]
A --> C[Old long-running threads keep running]
B --> D[New events use new code]
C --> E[Old code still executing]
D --> F[Risk: mixed behavior & leaked state]
E --> F
This behavior is not a Workflows-specific issue; it is a known Ignition/Jython lifecycle behavior discussed heavily in the forum.
This project is basically built around that reality:
- Keep one workflow runtime per Gateway JVM (not per interpreter).
- Don’t accidentally create new thread pools every time scripts reload.
- Make workflows durable by storing state in Postgres, not in memory.
- Keep tag/event scripts fast (single-digit ms ideally).
- Provide a clean “upgrade / cutover” during production enviroments using maintenance mode.
How we mitigate mixed-version execution¶
We use maintenance controls and generation swap instead of pretending restarts do not happen.
enterMaintenance(mode="drain"): stop claiming new work.- Let in-flight work finish.
swapIfDrained(): swap executors and increment generation when drained.exitMaintenance(): resume normal dispatch.
sequenceDiagram
participant Op as Operator/Admin
participant API as api.admin
participant RT as Runtime
Op->>API: enterMaintenance("drain")
API->>RT: maintenanceEnabled=true
Note over RT: dispatch no longer claims new rows
Op->>API: getMaintenanceStatus()
Op->>API: swapIfDrained()
API->>RT: replace executors, generation++
Op->>API: exitMaintenance()
This does not magically preempt running step code (that is a non-goal). It gives you a controlled cutover model.
References¶
These posts explain the lifecycle/thread behavior this design is built around:
- When are gateway scripts reloaded?
- project save creates new interpreter; running scripts/threads keep old environment.
- Request persistence of an object instance after scripting save
- old Jython objects can persist;
invokeAsynchronousthreads can be a major hazard. - Gateway Script Error: super(type, obj) ...
- clear explanation that save does not kill running threads.
- Garbage collection for scripts
- guidance on long-lived thread shutdown signaling.
- Tag value change event
- why tag event scripts should be single-digit milliseconds.
- Script execution time
- deeper details on tag event thread pools and missed events.
- Managing multiple asynchronous threads
- practical discussions around multi-thread script management in Ignition.