Architecture¶

This project implements a DBOS durable workflow engine inside Ignition, using Postgres as the system-of-record and a Gateway timer script dispatcher + worker pools to execute work. The end goal is the same as DBOS:

You can start long-running workflows from decorated functions
Work can retry on failures
Operators can observe, control, and intervene (hold/resume/stop, prompts, etc.)

The project also deals with script interpreter reloads (more here).

If you want the detailed thread/claim/lifecycle model, read Concurrency + Lifecycle.

Workflow and steps¶

Workflows are functions that describe a long-running process.
Steps are the units of work inside workflows that the library makes durable.
When a workflow calls a step, library persists the step result so it can:
replay the result on retry/restart
avoid re-running the step if it already succeeded
apply retry/backoff rules in a consistent way

System database¶

workflow rows (status, inputs, timestamps, ownership)
step outputs / attempts (call sequence, output/error, timing)
events + streams (for observability)
notifications/mailbox (send/recv for async coordination)
deduplication metadata (idempotent enqueue)

Executors + queues¶

Library executors repeatedly:

claim some ENQUEUED workflows from a queue
mark them owned by the executor
execute them
update status + heartbeats
write terminal state when done (SUCCESS/ERROR/CANCELLED)

Workflow Runtime¶

Enqueue¶

(e.i. create durable work)

A workflow request becomes a row in Postgres with status ENQUEUED.

This library has two ways to enqueue workflows.

Normal enqueue¶

Use when you can afford a DB write right now.

API: exchange.workflows.api.service.start(...)
Behavior: inserts a workflow as a ENQUEUED row into the database.

flowchart TB
  %% Nodes
  ui@{ shape: rounded, label: "Perspective/Gateway<br>script" }
  DB[(ENQUEUED)]

  %% Flow
  ui e1@-->|enqueue workflow| DB

  %% Edges
  e1@{ animation: slow }

Fast enqueue for tag events¶

Use when latency matters (e.g. tag event scripts) waiting on DB I/O.

API: exchange.workflows.api.service.enqueueInMemory(...)
Behavior: put request into in-memory queue only and tag change will proceed immediatly and not wait/hog the thread.
DB write happens on the next gateway timer dispatch.
Dispatch will pass through all workflow requests from the memory queue into the DB in a batch insert.

flowchart TB
  %% Nodes
  timer[Gateway Timer script <br><b>dispatch]
  inMemQueue@{ shape: das, label: "In-memory Queue" }
  tag@{ shape: rounded, label: "Tag value changed" }
  DB[(ENQUEUED)]

  %% Flow
  tag e1@-->|enqueue workflow| inMemQueue
  inMemQueue e2@-->|batch pass through| timer
  timer e3@-->|enqueue workflow| DB

  %% Edges
  e1@{ animation: slow }
  e2@{ animation: slow }
  e3@{ animation: slow }

Dispatch¶

On every gateway timer execution the dispatcher is responsible for a few things:

flush in-memory queue,
apply capacity limits,
claim rows,
enforce maintenance behavior,
and run work predictably.

flowchart TB
  %% Nodes
  timer[Gateway Timer script <br><b>dispatch</b> enqueue]
  timerDispatch[Gateway Timer script <br><b>dispatch</b> claim]
  inMemQueue@{ shape: das, label: "In-memory Queue" }
  tag@{ shape: rounded, label: "Tag value changed" }
  ui@{ shape: rounded, label: "Perspective/Gateway<br>script" }

  DB[(ENQUEUED)]

  subgraph pool [Worker Thread Pool]
  thread3@{ shape: procs, label: "Worker Threads" }
  end

  %% Flow
  tag e1@-->|enqueue workflow| inMemQueue
  inMemQueue e2@-->|batch pass through| timer
  timer e3@-->|enqueue workflow| DB
  ui e4@-->|enqueue workflow| DB
  DB e5@-->|claim workflow| timerDispatch
  timerDispatch e6@-->|dispatch work to available threads| pool

  %% Edges
  e1@{ animation: slow }
  e2@{ animation: slow }
  e3@{ animation: slow }
  e4@{ animation: slow }
  e5@{ animation: slow }
  e6@{ animation: slow }

Start execution¶

(e.i. transition to RUNNING when code actually begins)

When a worker thread picks up a claimed row:

it transitions to RUNNING
it sets started_at_epoch_ms at that moment
it computes deadline_epoch_ms = started_at + timeout

Step execution¶

Inside the workflow:

if ERROR: retry a configurable amount of times
else: execute and persist output/error

Terminal states¶

When workflow finishes, it writes one terminal status:

SUCCESS
ERROR
CANCELLED

Partition keys¶

(e.i. arbitration to a shared resource)

one workflow at a time per instrument/resource.

DB: don’t claim if another workflow in that partition is already pending/running
runtime: keep a local active partition set to avoid double-dispatch in-process

Ignition/Jython interpreter reload issue¶

Ignition adds one special operational hazard DBOS doesn’t have. When you save the project script library in Ignition:

A new Jython interpreter is created.
Old running threads are not automatically killed.
Those old threads keep running old code until they exit.

That can create version-mixing bugs and memory leaks when old interpreter objects stay referenced.

flowchart TD
  A[Designer saves project scripts] --> B[New Jython interpreter starts]
  A --> C[Old long-running threads keep running]
  B --> D[New events use new code]
  C --> E[Old code still executing]
  D --> F[Risk: mixed behavior & leaked state]
  E --> F

This behavior is not a Workflows-specific issue; it is a known Ignition/Jython lifecycle behavior discussed heavily in the forum.

This project is basically built around that reality:

Keep one workflow runtime per Gateway JVM (not per interpreter).
Don’t accidentally create new thread pools every time scripts reload.
Make workflows durable by storing state in Postgres, not in memory.
Keep tag/event scripts fast (single-digit ms ideally).
Provide a clean “upgrade / cutover” during production enviroments using maintenance mode.

How we mitigate mixed-version execution¶

We use maintenance controls and generation swap instead of pretending restarts do not happen.

enterMaintenance(mode="drain"): stop claiming new work.
Let in-flight work finish.
swapIfDrained(): swap executors and increment generation when drained.
exitMaintenance(): resume normal dispatch.

sequenceDiagram
  participant Op as Operator/Admin
  participant API as api.admin
  participant RT as Runtime

  Op->>API: enterMaintenance("drain")
  API->>RT: maintenanceEnabled=true
  Note over RT: dispatch no longer claims new rows
  Op->>API: getMaintenanceStatus()
  Op->>API: swapIfDrained()
  API->>RT: replace executors, generation++
  Op->>API: exitMaintenance()

This does not magically preempt running step code (that is a non-goal). It gives you a controlled cutover model.

References¶

These posts explain the lifecycle/thread behavior this design is built around:

When are gateway scripts reloaded?
project save creates new interpreter; running scripts/threads keep old environment.
Request persistence of an object instance after scripting save
old Jython objects can persist; invokeAsynchronous threads can be a major hazard.
Gateway Script Error: super(type, obj) ...
clear explanation that save does not kill running threads.
Garbage collection for scripts
guidance on long-lived thread shutdown signaling.
Tag value change event
why tag event scripts should be single-digit milliseconds.
Script execution time
deeper details on tag event thread pools and missed events.
Managing multiple asynchronous threads
practical discussions around multi-thread script management in Ignition.