Aegis Orchestrator
Architecture

Workflow Engine

Architecture of the AEGIS workflow FSM, Temporal integration, Blackboard system, and workflow execution lifecycle.

Workflow Engine

The AEGIS workflow engine executes declarative finite-state machine (FSM) workflows defined in YAML. It uses Temporal as the durable execution backend — Temporal guarantees that workflow state survives orchestrator restarts, container crashes, and network partitions.


Architecture Overview

The workflow engine spans two components: the Rust orchestrator handles registration, state persistence, and event sourcing; the TypeScript worker owns FSM interpretation and delegates all agent execution back to the Rust runtime via gRPC.

              ┌─────────────Rust Orchestrator──────────────┐
              │                                            │
       WorkflowParser          WorkflowRepository          │
       (YAML → domain)         (PostgreSQL)                │
              │                                            │
              └──────── StartWorkflowUseCase ──────────────┘

               ┌──────────────────┴───────────────────┐
               │                                      │
   HTTP POST /register-workflow          gRPC StartWorkflowExecution
               │                                      │
               ▼                                      ▼
     TypeScript Worker (:3000)           Temporal Server (:7233)
   workflow_definitions (PostgreSQL)              │
               │                         task queue: aegis-agents
               └──────────────────────────────────┘

                          aegis_workflow()
                       (single generic FSM interpreter)

         ┌────────────────────────┼────────────────────────┐
         │                        │                        │
 executeAgentActivity  executeSystemCommandActivity  validateOutputActivity
         │                        │                        │
         └────────────────────────┴────────────────────────┘

                   gRPC AegisRuntime (:50051)  ←── Rust Orchestrator
                   ExecuteAgent (streaming)
                   ExecuteSystemCommand (unary)
                   ValidateWithJudges (unary)

                HTTP POST /v1/temporal-events  ──► Rust EventBus
                (WorkflowExecution event store)

Temporal Integration

Temporal is used for durable workflow execution only. AEGIS does not expose Temporal concepts (workflows, activities, signals) in its public API or manifest format. Temporal is an infrastructure concern behind an Anti-Corruption Layer (ACL).

The AEGIS Workflow manifest maps to Temporal as follows:

AEGIS ConceptTemporal Concept
WorkflowTemporal Workflow Definition
WorkflowExecutionTemporal Workflow Run
WorkflowState (Agent)Temporal Activity
WorkflowState (System)Temporal Activity
WorkflowState (Human)Temporal Signal Handler
WorkflowState (ParallelAgents)Parallel Temporal Activities + aggregation
WorkflowState (ContainerRun)Temporal Activity (executeContainerRunActivity)
WorkflowState (ParallelContainerRun)Parallel Temporal Activities (executeContainerRunActivity × N)
WorkflowState (Subworkflow)Temporal Child Workflow Execution
BlackboardTemporal Workflow State (persisted in Temporal's event history)
TransitionRuleConditional logic within the Temporal workflow function

Durability Benefits

Because Temporal persists workflow event history:

  • An orchestrator crash during WorkflowState.Agent (e.g., while the agent container is running) will resume from the last committed state when the orchestrator restarts.
  • Human states can wait indefinitely for signals without consuming memory or CPU.
  • Workflow executions running for hours or days are fully supported.

Generic Interpreter Pattern

The TypeScript worker registers a single Temporal workflow function — aegis_workflow — on the aegis-agents task queue. Every AEGIS workflow, regardless of how many states it defines, executes through this one function. When the workflow starts, aegis_workflow calls the fetchWorkflowDefinition activity to load the FSM definition from PostgreSQL, initialises the Blackboard from the workflow's spec.context and any start-input, then runs the FSM loop until a terminal state (no transitions) is reached or the iteration safety bound of 1000 steps is hit.

This means deploying a new workflow YAML requires no TypeScript code changes and no new Temporal workflow function. The manifest is registered to the workflow_definitions table and becomes immediately executable by any running worker replica.


Domain Model

struct Workflow {
    id: WorkflowId,
    metadata: WorkflowMetadata,   // name, version, labels
    scope: WorkflowScope,         // Global | Tenant | User
    spec: WorkflowSpec,
    created_at: DateTime<Utc>,
}

struct WorkflowSpec {
    initial_state: String,                      // entry state name
    context: HashMap<String, Value>,            // default Blackboard values (spec.context)
    states: HashMap<String, WorkflowState>,
}

// The Workflow aggregate includes a `scope` field (WorkflowScope) that controls visibility.
// Scope can be Global (platform-wide, tenant_id = "aegis-system"), Tenant (default, visible
// to all users in the tenant), or User (private to the owning user). Name-based lookups
// resolve using narrowest-first precedence: User → Tenant → Global.
enum WorkflowScope {
    Global,   // tenant_id = "aegis-system"; visible to all tenants
    Tenant,   // visible to all users within the owning tenant (default)
    User,     // visible only to the owning user within their tenant
}

struct WorkflowState {
    kind: StateKind,
    timeout: Option<Duration>,
    transitions: Vec<TransitionRule>,
}

// All state-specific config is carried inside the StateKind variant.
enum StateKind {
    Agent {
        agent: String,                          // deployed agent name
        input: String,                          // Handlebars input template
        isolation: Option<IsolationMode>,
    },
    System {
        command: String,                        // shell command (Handlebars-rendered)
        env: HashMap<String, String>,
        workdir: Option<String>,
    },
    Human {
        prompt: String,                         // shown to operator (Handlebars-rendered)
        default_response: Option<String>,       // used when timeout elapses with no signal
    },
    ParallelAgents {
        agents: Vec<ParallelAgentConfig>,       // concurrent agent configs
        consensus: ConsensusConfig,             // aggregation strategy
    },
    ContainerRun {
        name: Option<String>,                   // human-readable label for Synapse UI
        image: String,                          // registry/repo:tag or standard runtime reference
        image_pull_policy: ImagePullPolicy,     // Always | IfNotPresent | Never
        command: Vec<String>,                   // exec-form command; joined under sh -c if shell=true
        shell: bool,                            // wrap in /bin/sh -c
        env: HashMap<String, String>,
        workdir: Option<String>,
        volumes: Vec<ContainerVolumeMount>,
        resources: Option<ContainerResources>,  // cpu millicores, memory, timeout
        registry_credentials: Option<String>,   // OpenBao path resolved by orchestrator
        retry: Option<RetryConfig>,
    },
    ParallelContainerRun {
        steps: Vec<ContainerRunConfig>,         // concurrent container steps
        completion: ParallelCompletionStrategy, // AllSucceed | AnySucceed | BestEffort
    },
    Subworkflow {
        workflow_id: String,                    // deployed child workflow name or UUID
        mode: SubworkflowMode,                  // Blocking | FireAndForget
        result_key: Option<String>,             // Blackboard key for child's final output (blocking)
        input: Option<String>,                  // Handlebars template passed as child's start input
    },
}

enum SubworkflowMode {
    Blocking,       // parent waits for child completion
    FireAndForget,  // parent continues immediately
}

struct TransitionRule {
    condition: Option<TransitionCondition>,     // None = unconditional (always)
    target: String,                             // target state name
    feedback: Option<String>,                   // Handlebars string; available as {{state.feedback}}
}

enum TransitionCondition {
    Always,
    OnSuccess, OnFailure,
    ScoreAbove { threshold: f64 },
    ScoreBelow { threshold: f64 },
    ScoreBetween { min: f64, max: f64 },
    ScoreAndConfidenceAbove { score: f64, confidence: f64 },
    ConfidenceAbove { threshold: f64 },
    Consensus { threshold: f64, agreement: f64 },
    AllApproved, AnyRejected,
    InputEquals { value: String },
    InputEqualsYes, InputEqualsNo,
    ExitCodeZero, ExitCodeNonZero,
    ExitCode { code: i32 },
    Custom { expression: String },              // Handlebars boolean expression
}

gRPC Bridge

The TypeScript worker never executes agent code directly. All compute-heavy work is delegated back to the Rust orchestrator via the AegisRuntime gRPC service:

ActivitygRPC MethodCall TypePurpose
executeAgentActivityAegisRuntime.ExecuteAgentserver-streamingRuns a full 100monkeys iterative execution; collects all ExecutionEvent stream messages before returning.
executeSystemCommandActivityAegisRuntime.ExecuteSystemCommandunaryRuns a shell command on the orchestrator host.
validateOutputActivityAegisRuntime.ValidateWithJudgesunaryRuns gradient validation against configured judge agents; returns a score (0.0–1.0) and confidence.
executeParallelAgentsActivityAegisRuntime.ExecuteAgent × N, then ValidateWithJudgesparallel + unaryFans out to N concurrent ExecuteAgent calls via Promise.all, then computes consensus.
executeContainerRunActivityAegisRuntime.ExecuteContainerRununaryCreates a container, mounts volumes via NFS gateway, runs the command, captures stdout/stderr/exit_code, destroys the container. No LLM, no bootstrap, no iteration loop.
executeParallelContainerRunActivityAegisRuntime.ExecuteContainerRun × NPromise.allSettled + aggregationDispatches N concurrent executeContainerRunActivity calls; aggregates results per ParallelCompletionStrategy.
executeSubworkflowActivityTemporal Child Workflow APIchild workflowStarts a child workflow execution via Temporal's native child workflow support. In blocking mode, awaits the child's completion and returns the final Blackboard. In fire_and_forget mode, starts the child and returns immediately with the child execution ID.

The gRPC server runs on port 50051 (configurable via AEGIS_RUNTIME_GRPC_URL). Security policy enforcement, SEAL attestation, and container lifecycle remain entirely in Rust — the TypeScript worker has no direct access to agent containers or the Docker socket.

After each state transition the worker POSTs a structured event to POST /v1/temporal-events on the Rust orchestrator. The TemporalEventListener maps these to typed WorkflowEvent domain objects, appends them to the WorkflowExecution event store in PostgreSQL, and publishes them to the internal EventBus.


Blackboard System

The Blackboard is a HashMap<String, serde_json::Value> that carries shared context across all states in a workflow execution. It is owned by the Temporal worker during live execution and persisted by Temporal's event-sourcing.

Ownership Lifecycle

The Blackboard moves through three distinct phases:

PhaseOwnerWhat happens
SeedRust orchestratorspec.context keys and any values from StartWorkflowExecutionRequest.blackboard are written into WorkflowExecution.blackboard before the Temporal worker starts. Accessible in manifests as {{workflow.context.KEY}} or {{blackboard.KEY}}.
Live executionTypeScript Temporal workerThe FSM runs. Each state's output is written to the Blackboard under the state's name (e.g. REQUIREMENTS.output, REQUIREMENTS.status, REQUIREMENTS.score). Handlebars templates in input, command, env, prompt, and feedback fields are resolved against the live Blackboard.
CaptureRust orchestratorOn WorkflowExecutionCompleted, the final_blackboard snapshot from Temporal is merged back into WorkflowExecution.blackboard and persisted to PostgreSQL. A WorkflowCompleted domain event carrying final_blackboard is published to the event bus.

The Rust orchestrator never mutates the Blackboard during live execution — only the Temporal worker does.

State Output Conventions

When a state completes, the Temporal worker writes the execution result to the Blackboard under the state's name. The shape differs by state kind:

Agent state:

Blackboard["REQUIREMENTS"] = {
  "status": "success",
  "output": "...agent's final output text...",
  "score": 0.92,
  "iterations": 2
}

ContainerRun state:

Blackboard["BUILD"] = {
  "exit_code": 0,
  "stdout": "Compiling my-crate v0.1.0\n   Finished release in 12.3s",
  "stderr": "",
  "duration_ms": 12350
}

ParallelContainerRun state (keyed by step name):

Blackboard["TEST"] = {
  "unit-tests":   { "exit_code": 0, "stdout": "...", "stderr": "",    "duration_ms": 4200 },
  "lint":         { "exit_code": 0, "stdout": "",    "stderr": "",    "duration_ms": 870  },
  "format-check": { "exit_code": 1, "stdout": "",    "stderr": "...", "duration_ms": 410  }
}

Downstream states reference these values via Handlebars:

ARCHITECTURE:
  kind: Agent
  agent: architect-agent
  input: |
    Requirements: {{REQUIREMENTS.output}}
    
    Design the architecture.
  transitions:
    - condition: on_success
      target: ARCH_APPROVAL

Template Variables

Handlebars templates are rendered against the live Blackboard at the moment each state executes. The full variable reference is in the Workflow Manifest Reference.

Key variables:

VariableDescription
{{STATE.output}}Final output text of an Agent state.
{{STATE.output.stdout}}stdout of a System or ContainerRun state.
{{STATE.output.stderr}}stderr of a System or ContainerRun state.
{{STATE.output.exit_code}}Exit code of a System or ContainerRun state.
{{STATE.output.duration_ms}}Wall-clock duration in ms (ContainerRun states).
{{STATE.output.STEPNAME.stdout}}Per-step stdout (ParallelContainerRun states).
{{STATE.output.STEPNAME.stderr}}Per-step stderr (ParallelContainerRun states).
{{STATE.output.STEPNAME.exit_code}}Per-step exit code (ParallelContainerRun states).
{{STATE.status}}"success" or "failed".
{{STATE.score}}Validation score of an Agent state (0.0–1.0).
{{STATE.consensus.score}}Aggregated score of a ParallelAgents state.
{{STATE.agents.N.output}}Individual judge output (ParallelAgents, 0-indexed).
{{workflow.context.KEY}}Value from spec.context.
{{blackboard.KEY}}Any key set in spec.context or by a prior System state.
{{state.feedback}}feedback string from the incoming transition.
{{input.KEY}}Key from the workflow start --input payload.

Human State Lifecycle

A WorkflowState.Human suspends the Temporal workflow run and waits for an external signal:

WorkflowExecution
  state = "approve_requirements"
  status = WAITING_FOR_SIGNAL

         │  Signal arrives via:
         │    POST /v1/workflows/executions/{id}/signal
         │    {"response": "approved"}
         │  or via HTTP API:
         │    POST /v1/human-approvals/{id}/approve

  Temporal signal received → Blackboard["approve_requirements"]["decision"] = "approved"
  TransitionRule evaluates → target = "implement"
  Workflow continues

Human states respect timeout_secs. If no signal arrives within the timeout, the workflow evaluates its transitionS. Typically the last (unconditional) transition leads to a failed state.


ContainerRun State

When a WorkflowState.ContainerRun is entered, the Temporal worker calls executeContainerRunActivity. This activity calls the Rust orchestrator via AegisRuntime.ExecuteContainerRun gRPC — the orchestrator creates the container, mounts volumes through the NFS Server Gateway, runs the command, captures output, destroys the container, and returns the result. There is no LLM, no bootstrap.py, and no 100monkeys iteration loop.

Enter BUILD (ContainerRun)

├── executeContainerRunActivity(image, command, volumes, resources)
│     │
│     ├── gRPC AegisRuntime.ExecuteContainerRun
│     │     ├── Pull image (respects image_pull_policy)
│     │     ├── docker run --rm -v <NFS-mount>:/workspace --cpus 2 --memory 4g
│     │     │     └── cargo build --release
│     │     ├── Capture stdout + stderr (up to 1 MB each)
│     │     └── Return { exit_code, stdout, stderr, duration_ms }
│     │
│     └── Blackboard["BUILD"] = { exit_code, stdout, stderr, duration_ms }

TransitionRules evaluated on exit_code → next state chosen

For ParallelContainerRun, the worker fans out with Promise.allSettled and aggregates per ParallelCompletionStrategy:

Enter TEST (ParallelContainerRun, completion: all_succeed)

├── executeContainerRunActivity(unit-tests, ...)
├── executeContainerRunActivity(lint, ...)        (all three run simultaneously)
└── executeContainerRunActivity(format-check, ...)

         └── Promise.allSettled — wait for all three

               completion: all_succeed
                 → all exit_code == 0?  → on_success transition
                 → any exit_code != 0?  → on_failure transition

         Blackboard["TEST"] = {
           "unit-tests":   { exit_code, stdout, stderr, duration_ms },
           "lint":         { exit_code, stdout, stderr, duration_ms },
           "format-check": { exit_code, stdout, stderr, duration_ms }
         }

ParallelAgents State

When a WorkflowState.ParallelAgents is entered, the Temporal workflow dispatches all listed agent IDs concurrently as Temporal Activities:

Enter PARALLEL_REVIEW

├── Activity: security-reviewer
├── Activity: performance-reviewer  (all three run simultaneously)
└── Activity: style-reviewer

         └── All three complete

               Blackboard["PARALLEL_REVIEW"] = {
                 "consensus": {
                   "score": 0.91,
                   "confidence": 0.87,
                   "all_succeeded": true
                 },
                 "agents": [
                   {"status": "success", "score": 0.91, "output": "{...}"},
                   {"status": "success", "score": 0.88, "output": "{...}"},
                   {"status": "success", "score": 0.95, "output": "{...}"}
                 ]
               }

         TransitionRules evaluated → next state chosen

If any single agent fails, all_succeeded is set to false. The default transition (no condition) should lead to the failure handling state.


Workflow Composition

The Subworkflow state kind enables parent-child workflow composition. When a workflow state with kind: Subworkflow is entered, the Temporal worker starts a child workflow execution using Temporal's native child workflow API rather than a gRPC activity call.

How It Works

Enter TRIGGER_CHILD (Subworkflow, mode: blocking)

├── Temporal startChildWorkflowExecution(workflow_id, input)
│     │
│     ├── Child workflow starts as an independent Temporal workflow run
│     ├── Child has its own Blackboard, state machine, and execution history
│     ├── Child runs the same aegis_workflow() generic interpreter
│     │
│     └── Parent waits for child WorkflowExecutionCompleted signal
│           │
│           └── Child's final_blackboard returned to parent

├── Blackboard["TRIGGER_CHILD"] = {
│     "status": "success",
│     "result": <child's final output>,
│     "child_execution_id": "wfx-child-..."
│   }
├── Blackboard["pipeline_output"] = <child's final_blackboard>  (via result_key)

TransitionRules evaluated → next state chosen

In fire_and_forget mode, the parent does not wait:

Enter START_BACKGROUND (Subworkflow, mode: fire_and_forget)

├── Temporal startChildWorkflowExecution(workflow_id, input)
│     └── Child starts; parent does NOT await completion

├── Blackboard["START_BACKGROUND"] = {
│     "status": "success",
│     "child_execution_id": "wfx-child-..."
│   }

TransitionRules evaluated → parent continues immediately

Design Principles

  • Isolation: The child workflow has its own Blackboard. It cannot read or write the parent's Blackboard directly. Data flows in via input and out via the child's final_blackboard.
  • Durability: Temporal's child workflow mechanism ensures that if the parent crashes, the child either completes or is cancelled according to the configured ParentClosePolicy. By default, children are allowed to run to completion.
  • Reusability: Any deployed workflow can be invoked as a child. The child does not need to know it is being called from a parent — it is a standard workflow.
  • Observability: Child workflow executions appear as independent runs in the Temporal UI and in the AEGIS execution history API. The parent-child relationship is tracked via WorkflowChildStarted and WorkflowChildCompleted domain events.

Workflow Execution Events

The workflow engine publishes to the event bus:

EventTrigger
WorkflowStartedStartWorkflowUseCase completes
WorkflowStateEnteredTemporal activity begins
WorkflowStateCompletedTemporal activity succeeds
WorkflowStateFailedTemporal activity fails or times out
WorkflowSignalReceivedHuman state receives signal
WorkflowCompletedReaches a terminal state with no transitions
WorkflowChildStartedSubworkflow state starts a child workflow execution
WorkflowChildCompletedChild workflow execution completes (blocking mode)
WorkflowFailedReaches a terminal error state or exceeds Temporal timeout

These events are consumable via the gRPC streaming API for real-time monitoring dashboards.

On this page