Aegis Orchestrator
Deployment

Multi-Node Deployment

Distribute AEGIS across multiple machines for production deployments using the gRPC cluster protocol.

Multi-Node Deployment

AEGIS supports distributed deployments across multiple machines. Each machine runs one aegis daemon process configured with two orthogonal roles:

  • spec.node.type — the node's deployment role (whether it hosts agents vs. serves as an API entry point)
  • spec.cluster.role — the node's cluster coordination role (whether it controls the cluster vs. accepts forwarded work)

These two settings are independent and can be mixed freely. A node can be spec.node.type: orchestrator and spec.cluster.role: worker at the same time.


Node Types (spec.node.type)

TypeRole
orchestratorHosts the management plane: API server, workflow engine, Temporal client, Cortex connection, secrets manager. Does not run agent containers locally.
edgeExecutes agent containers (Docker runtime). Does not expose the public API. Connects to a controller node for task assignment.
hybridCombines both roles on a single machine. The default for development and small deployments.

Cluster Protocol (spec.cluster.role)

AEGIS uses a dedicated NodeClusterService gRPC protocol on port 50056 for inter-node coordination. This is separate from the agent gRPC API on port 50051.

Cluster Roles

RoleDescription
controllerManages the NodeCluster aggregate. Routes executions to workers. Issues NodeSecurityTokens to attested workers. Exposes NodeClusterService on port 50056.
workerAttests to a controller on startup. Advertises NodeCapabilityAdvertisement. Accepts forwarded executions via the ForwardExecution RPC.
hybridController and worker in one process. Used for standalone single-node and development deployments. No separate controller endpoint needed.

The Two-Tier Model

  spec.node.type     →  determines what the node DOES for agents
                        (run the API, run containers, or both)

  spec.cluster.role  →  determines how nodes COORDINATE with each other
                        (who routes work, who accepts work)

A common production pairing: the API-facing machine is type: orchestrator, cluster.role: controller; each GPU machine is type: edge, cluster.role: worker. The controller receives execution requests from clients and routes them to workers best suited to run them.

NodeSecurityToken

After a worker successfully attests to a controller, the controller issues a NodeSecurityToken: an RS256 JWT signed by the controller's OpenBao Transit key. This token:

  • Has a 1-hour TTL and is auto-refreshed before expiry
  • Is analogous to the agent SecurityToken from the SEAL protocol, but scoped to node identity
  • Contains node_id, role, capabilities_hash, iat, and exp
  • Is included in every subsequent cluster RPC wrapped in a SealNodeEnvelope

All inter-node gRPC calls use this envelope:

SealNodeEnvelope {
  node_security_token: <NodeSecurityToken JWT>,
  signature: <Ed25519 signature over payload>,
  payload: <serialized RPC request bytes>
}

Typical Topologies

Development / Single Node

┌──────────────────────────────┐
│   Hybrid Node                │  spec.node.type: hybrid
│   spec.cluster.role: hybrid  │
│                              │
│  API Server  (gRPC + REST)   │
│  Scheduler                   │
│  Docker (agent containers)   │
│  NodeClusterService :50056   │  ← routes to itself
└──────────────────────────────┘

Use type: hybrid and cluster.role: hybrid for local development and small deployments. This is the default in aegis-config.yaml.


Production — Separated Control / Data Plane

┌────────────────────────────────────────┐
│   Orchestrator Node                    │  spec.node.type: orchestrator
│   (1–3 instances)                      │
│                                        │
│  API → gRPC :50051 → REST :8088        │
│  Workflow engine                        │
│  Temporal client                        │
│  Secrets (OpenBao)                     │
└──────────────┬─────────────────────────┘
               │  (internal network)
        ┌──────┴──────┐
        │             │
┌───────▼───┐   ┌─────▼─────┐
│  Edge #1  │   │  Edge #2  │   spec.node.type: edge
│  Docker   │   │  Docker   │
│  agents   │   │  agents   │
└───────────┘   └───────────┘

Edge nodes handle the compute-intensive agent workloads. Adding more edge nodes scales execution throughput without affecting the orchestrator.


Production — Cluster Protocol (Controller + Workers)

This topology adds the NodeClusterService layer for authenticated, capability-aware execution routing.

┌──────────────────────────────────────────────────────────────────────┐
│                         Controller Node                              │
│          spec.node.type: orchestrator                                │
│          spec.cluster.role: controller                               │
│                                                                      │
│   gRPC :50051  ←  agent API (external clients)                      │
│   gRPC :50056  ←  NodeClusterService (workers only)                 │
└──────────────────────────────┬───────────────────────────────────────┘
                               │ SealNodeEnvelope over gRPC :50056
               ┌───────────────┼───────────────┐
               ▼               ▼               ▼
        ┌──────────┐    ┌──────────┐    ┌──────────┐
        │ Worker 1 │    │ Worker 2 │    │ Worker 3 │
        │ :50051   │    │ :50051   │    │ :50051   │
        └──────────┘    └──────────┘    └──────────┘
          spec.node.type: edge
          spec.cluster.role: worker

When a client submits an execution to the controller, the NodeRouter selects the best available worker (Phase 1: round-robin among healthy workers with matching tags; Phase 2: load-aware scoring). The controller forwards the execution to the selected worker via the ForwardExecution server-streaming RPC on port 50051, then streams ExecutionEvents back to the original client.

The ClusterAwareExecutionService handles this routing and forwarding transparently — it wraps the local ExecutionService and routes to workers when cluster mode is enabled, falling back to local execution when no workers are available.


Configuring Nodes

Orchestrator / Controller Node

apiVersion: 100monkeys.ai/v1
kind: NodeConfig
metadata:
  name: "orchestrator-primary"
spec:
  node:
    id: "orch-node-1"
    type: "orchestrator"
    region: "us-west-2"
    tags: ["primary"]

  # Orchestrator nodes must specify all external dependencies
  llm_providers: [...]
  storage: { backend: "seaweedfs", ... }
  # ...

Edge Node

apiVersion: 100monkeys.ai/v1
kind: NodeConfig
metadata:
  name: "edge-worker-1"
spec:
  node:
    id: "edge-node-1"
    type: "edge"
    region: "us-west-2"
    tags: ["gpu", "large-memory"]
    resources:
      cpu_cores: 32
      memory_gb: 128
      disk_gb: 500
      gpu: true

  runtime:
    # Point the edge node at the orchestrator for callbacks
    orchestrator_url: "https://orchestrator.internal:8080"
    docker_network_mode: "aegis-net"
    nfs_server_host: "127.0.0.1"

  # Edge nodes do not need llm_providers or storage config —
  # they delegate those duties to the orchestrator

spec.node.resources

Declare available hardware so the scheduler can make placement decisions:

FieldTypeDescription
cpu_coresintegerCPU cores available to agent containers
memory_gbintegerRAM in GB available to agent containers
disk_gbintegerDisk space in GB
gpubooleanWhether a GPU is available

spec.node.tags

Tags are used for execution target matching. An agent manifest can specify spec.execution.target_tags to pin executions to nodes with matching tags:

# In agent manifest
spec:
  execution:
    target_tags: ["gpu"]    # Only schedule on nodes tagged "gpu"

Cluster Configuration

The spec.cluster block configures the gRPC cluster protocol. All nodes require a persistent Ed25519 keypair; if keypair_path does not exist on disk, AEGIS auto-generates one on first run.

Controller Node

spec:
  node:
    id: "ctrl-001"
    type: orchestrator
    region: us-east-1
    tags: [controller, production]
  network:
    port: 8080
    grpc_port: 50051
  cluster:
    role: controller
    node_id: "ctrl-001"           # must match spec.node.id
    keypair_path: /etc/aegis/node-keypair.pem
    cluster_grpc_port: 50056

The controller exposes NodeClusterService on cluster_grpc_port (default 50056). Only attested workers may call RPCs on this port.

Worker Node

spec:
  node:
    id: "worker-gpu-001"
    type: edge
    region: us-east-1
    tags: [gpu, production]
    resources:
      cpu_cores: 16
      memory_gb: 64
      gpu: true
  cluster:
    role: worker
    node_id: "worker-gpu-001"     # must match spec.node.id
    controller_endpoint: "https://ctrl-001.internal:50056"
    keypair_path: /etc/aegis/node-keypair.pem
    heartbeat_interval_seconds: 30

Workers contact the controller at controller_endpoint on startup to perform attestation. The heartbeat_interval_seconds (default 30) controls how frequently the worker sends a Heartbeat RPC carrying its current load and capabilities.

Hybrid Node (Standalone / Development)

spec:
  node:
    id: "dev-local"
    type: hybrid
  cluster:
    role: hybrid   # controller + worker in one process
    node_id: "dev-local"
    keypair_path: /etc/aegis/node-keypair.pem

No controller_endpoint is needed in hybrid mode — the process routes executions to itself.


Node Attestation Flow

Before a worker can receive forwarded executions, it must prove its identity to the controller. This happens automatically on startup via the following sequence:

Worker                                      Controller
  │                                              │
  │  1. Load Ed25519 keypair from               │
  │     spec.cluster.keypair_path               │
  │     (auto-generated if absent)              │
  │                                              │
  │──── AttestNode(node_id, public_key) ────────►│
  │                                              │  2. Generates cryptographic challenge
  │◄─── ChallengeNode(challenge_bytes) ──────────│
  │                                              │
  │  3. Signs challenge_bytes with              │
  │     Ed25519 private key                     │
  │                                              │
  │──── ChallengeNode(signature) ───────────────►│
  │                                              │  4. Verifies signature against
  │                                              │     registered public key
  │◄─── NodeSecurityToken (RS256 JWT) ───────────│  5. Issues token (1-hour TTL)
  │                                              │     via OpenBao Transit signing
  │  6. Wraps all future RPCs in               │
  │     SealNodeEnvelope {                      │
  │       node_security_token,                  │
  │       signature,                            │
  │       payload                                │
  │     }                                       │
  │                                              │
  │──── RegisterNode(capabilities) ────────────►│  7. Advertises NodeCapabilityAdvertisement
  │                                              │     { gpu_count, vram_gb, cpu_cores,
  │                                              │       available_memory_gb,
  │                                              │       supported_runtimes, tags }
  │◄─── Registered ─────────────────────────────│
  │                                              │
  │──── Heartbeat (every 30s) ─────────────────►│  8. Keeps NodePeer status Active
  │◄─── NodeCommand (optional) ─────────────────│     Response may carry commands:
  │                                              │     drain, config push, shutdown

The NodeSecurityToken is automatically refreshed before its 1-hour expiry via the ongoing heartbeat cycle. No manual token management is required.

NodePeer status transitions:

  • Active — node is registered and sending heartbeats within the expected interval
  • Draining — controller has issued a drain command; no new executions are routed to this node
  • Unhealthy — no heartbeat received within 3× the expected interval

NodeClusterService RPC Reference

The NodeClusterService exposes 10 RPCs on port 50056. All RPCs except AttestNode and ChallengeNode require a valid NodeSecurityToken in the authorization metadata key and an SealNodeEnvelope wrapping the payload. AttestNode is unauthenticated (it is the first call a new worker makes).

RPCDirectionDescription
AttestNodeWorker → ControllerInitiate node attestation; send public key
ChallengeNodeBidirectionalController issues challenge; worker returns Ed25519 signature
RegisterNodeWorker → ControllerRegister with NodeCapabilityAdvertisement after attestation
HeartbeatWorker → ControllerPeriodic status update (default 30s); response may carry NodeCommands
DeregisterNodeWorker → ControllerGraceful deregistration before shutdown
RouteExecutionController-internalReturns ExecutionRoute { target_node_id, worker_grpc_address }
ForwardExecutionController → Worker (server-streaming)Execute an agent on this worker; streams ExecutionEvents back

Execution forwarding end-to-end flow:

  1. Client sends an execution request to the controller.
  2. ClusterAwareExecutionService calls RouteExecutionUseCase to select a worker based on health, tags, and availability.
  3. Controller connects to the selected worker via NodeClusterClient::connect_to_worker().
  4. Controller calls forward_execution() with the original execution_id preserved for end-to-end tracing correlation.
  5. Worker runs the execution locally via start_execution_with_id(), importing the upstream execution ID rather than generating a new one.
  6. Execution events stream back to the controller via the gRPC server-streaming response, which relays them to the original client.

| SyncConfig | Worker → Controller | Worker requests current config from controller | | PushConfig | Controller → Worker | Controller pushes updated config to a specific worker | | ListPeers | Any → Controller | List all registered NodePeers with their status and capabilities |


Node Registration

Node registration is performed via gRPC using the NodeClusterService protocol described above, not via HTTP. The original HTTP-based NodeIdentity registration is used in legacy single-node mode only and is not involved in cluster coordination.

The registration sequence after successful attestation:

  1. Worker calls RegisterNode carrying its NodeCapabilityAdvertisement
  2. Controller records the NodePeer in the NodeCluster aggregate
  3. Worker enters the heartbeat loop (Heartbeat every heartbeat_interval_seconds)
  4. Controller updates NodePeer.last_heartbeat_at and NodePeer.status on each heartbeat
  5. On graceful shutdown, worker calls DeregisterNode; controller marks NodePeer as removed

Networking Requirements

ConnectionProtocolPortDirection
Client → ControllerHTTP REST8088Inbound to controller
Client → ControllergRPC (agent API)50051Inbound to controller
Worker → ControllergRPC (NodeClusterService)50056Outbound from worker
Controller → WorkergRPC (ForwardExecution)50051Outbound from controller
Orchestrator → Temporal7233OutboundWorkflow engine
Orchestrator → SeaweedFS8888OutboundStorage filer
Edge → SeaweedFS8888OutboundVolume data access
Edge agent containers → Edge daemon2049 (NFS)InternalVolume mounts via NFS Gateway

Firewall rules must allow:

  • Controller inbound on 8080, 50051, and 50056
  • Workers inbound on 50051 (for ForwardExecution streaming from controller)
  • Workers outbound to controller on 50056

Port 50056 should not be exposed to external clients — it is for inter-node cluster coordination only. Use network-level ACLs to restrict access to known worker IPs.


High Availability

Phase 1 (Current): Single Controller + Multiple Workers

Deploy one controller node and N worker nodes. The controller is the coordination point; workers are horizontally scalable. Multiple worker nodes provide execution capacity and fault tolerance for agent workloads.

                     ┌──────────────────────┐
                     │    Load Balancer      │  (agent API traffic only)
                     └──────────┬───────────┘
                                │ :8080 / :50051
                     ┌──────────▼───────────┐
                     │   Controller Node     │  spec.cluster.role: controller
                     │   (single instance)  │
                     └──────────┬───────────┘
                                │  port :50056
              ┌─────────────────┼─────────────────┐
              ▼                 ▼                 ▼
    ┌──────────────┐  ┌──────────────┐  ┌──────────────┐
    │   Worker 1   │  │   Worker 2   │  │   Worker 3   │
    │   :50051     │  │   :50051     │  │   :50051     │
    └──────────────┘  └──────────────┘  └──────────────┘

For controller resilience in Phase 1, run the controller with PostgreSQL as the persistence backend; on controller restart, workers re-attest automatically via the AttestNodeChallengeNodeRegisterNode flow. The controller outage window only affects routing; in-flight executions on workers complete normally.

Phase 2 (Planned): Controller HA with Raft

Phase 2 will introduce a Raft-based controller consensus layer, allowing N controller replicas with automatic leader election. Workers will use a discovery endpoint returned during attestation to locate the current leader. This is not yet implemented.

For all orchestrator instances in either phase, share the same PostgreSQL database for consistent execution state. Redis is optional for session-level caching.

                      ┌─────────────────┐
                      │  Load Balancer  │
                      └────────┬────────┘
               ┌───────────────┼────────────────┐
        ┌──────▼──────┐  ┌─────▼──────┐  ┌─────▼──────┐
        │Controller 1 │  │Controller 2│  │Controller 3│  (Phase 2)
        │  (leader)   │  │ (follower) │  │ (follower) │
        └──────┬──────┘  └────────────┘  └────────────┘
               │  Raft consensus
        ┌──────▼──────────────┐
        │    PostgreSQL        │
        │    (shared state)   │
        └─────────────────────┘

Day-2 Operations: mTLS Certificate Rotation

In production environments, AEGIS requires mutual TLS (mTLS) for all inter-node communication on port 50056. This ensures that only nodes possessing a certificate signed by the platform CA can even attempt the SEAL attestation flow.

Rotating the Platform CA

If the root platform CA is rotated, you must perform a multi-step rollout to avoid cluster-wide disconnects:

  1. Distribute New CA: Update spec.cluster.tls.ca_cert on all nodes to include BOTH the old and new CA certificates in a single PEM bundle. Restart all nodes.
  2. Issue New Node Certs: Generate new node certificates signed by the new CA.
  3. Update Node Certs: Update spec.cluster.tls.cert_path and key_path on each node one-by-one and restart.
  4. Remove Old CA: Once all nodes are using certificates from the new CA, remove the old CA from the ca_cert bundle.

Zero-Downtime Node Certificate Rotation

Individual node certificates (e.g., node.crt) can be rotated without cluster downtime as long as the CA remains valid. The aegis daemon watches the certificate files on disk and reloads them automatically upon change (when using standard spec.cluster.tls configuration).

# Example: Rotating a worker certificate
cp new-node.crt /etc/aegis/certs/node.crt
cp new-node.key /etc/aegis/certs/node.key
# AEGIS will detect the change and use the new cert for the next gRPC connection

Troubleshooting Cluster Operations

NodeSecurityToken Attestation Failures

If a worker fails to join the cluster, check the controller logs for the following common errors:

ErrorRoot CauseResolution
TokenExpiredClock Skew: The worker's system clock is significantly ahead of the controller's.Synchronize clocks on all nodes using NTP (e.g., chronyd).
NonceReplayReplay Attack / Rapid Restart: The worker sent an attestation request with a nonce that was already used in the last 5 minutes.Wait 5 minutes before restarting the worker, or ensure the worker generates a fresh UUID for the nonce field on every attempt.
InvalidSignatureKeypair Mismatch: The worker's Ed25519 signature does not match its registered public key.Verify that spec.cluster.keypair_path points to the same persistent key that was used during the initial RegisterNode call.
UntrustedCAmTLS Failure: The certificate presented by the node is not signed by the CA in spec.cluster.tls.ca_cert.Verify that all nodes share the same platform CA bundle.

Diagnostic Commands

Use the aegis CLI on the controller node to inspect cluster health:

# List all registered peers
aegis node peers

# Check end-to-end cluster health from the local node config
aegis status --cluster

Health Endpoints

Each AEGIS daemon exposes HTTP health endpoints on the configured REST port (default 8080). These endpoints are useful for load balancer health checks, Kubernetes probes, and manual diagnostics.

EndpointMethodDescription
/health/liveGETLiveness probe. Returns 200 OK if the process is running. Does not check downstream dependencies.
/health/readyGETReadiness probe. Returns 200 OK only when all critical subsystems (database, Temporal, event bus) are initialised. Returns 503 Service Unavailable during startup or if a dependency becomes unreachable.
/healthGETComposite health check. Returns 200 OK with a JSON body containing uptime_seconds and subsystem status. Used by the aegis status CLI command and the daemon client library.

In a clustered deployment, the controller node's HealthSweeper background task also monitors worker health by evaluating heartbeat freshness. If a worker misses 3 consecutive heartbeat intervals, the HealthSweeper marks its NodePeer status as Unhealthy and emits a ClusterEvent::NodeUnhealthy event.


Remote Volume Access

When an agent executing on Worker Node A needs to access a volume that resides on Worker Node B, AEGIS transparently proxies the file operation via the RemoteStorageService gRPC protocol.

How It Works

  1. The StorageRouter on Node A inspects the file path. Paths prefixed with /aegis/seal/{node_id}/{volume_id}/... are routed to the SealStorageProvider.
  2. SealStorageProvider extracts the target node_id from the path, looks up the node's gRPC address in the NodeClusterRepository, and establishes (or reuses) a gRPC channel.
  3. Each RPC call is wrapped in an SealNodeEnvelope carrying the node's Ed25519 signature and NodeSecurityToken.
  4. The target node's RemoteStorageServiceHandler verifies the envelope, performs authoritative AegisFSAL access checks, and delegates to its local StorageProvider.
  5. The response is returned to the requesting node and surfaced through the standard StorageProvider trait.

Supported Operations

All POSIX-style file operations are supported over the wire:

  • CreateDirectory, DeleteDirectory
  • SetQuota, GetUsage
  • OpenFile, ReadAt, WriteAt, CloseFile
  • Stat, Readdir
  • CreateFile, DeleteFile, Rename
  • HealthCheck

Path Format

Remote volume paths follow this convention:

/aegis/seal/{target_node_id}/{volume_id}/{path/within/volume}

The StorageRouter detects the /aegis/seal/ prefix and dispatches to SealStorageProvider. All other /aegis/volumes/... paths route to the default backend (SeaweedFS/OpenDAL), and bare absolute paths (/opt/data/...) route to the local host filesystem.


Configuration Hierarchy

AEGIS supports hierarchical configuration layering with the following precedence (lowest to highest):

ScopeDescriptionExample
GlobalCluster-wide defaults applied to all nodesDefault LLM provider settings
TenantPer-tenant overrides (keyed by TenantSlug)Tenant-specific rate limits
NodePer-node overrides (keyed by NodeId)Node-specific runtime settings
LocalOn-disk aegis-config.yaml loaded at startupHardware-specific configuration

When the controller pushes a configuration update via the PushConfig heartbeat command, the worker merges the layers in precedence order. Each layer is stored as a ConfigSnapshot value object containing the scope, a JSON payload, and a version hash. The merged result is a MergedConfig that the worker applies atomically.

Workers can also pull their current effective configuration on demand using the SyncConfig RPC. This is useful after a restart when the worker needs to catch up on configuration changes that occurred while it was offline.

After merging configuration layers, the worker runs EffectiveConfigValidator to verify that the merged result contains all required sections (runtime, storage, llm) before accepting work. If validation fails, the worker logs a fatal error and refuses to enter the heartbeat loop — preventing a misconfigured node from silently accepting executions it cannot fulfil.


Worker Lifecycle

When a daemon starts with spec.cluster.role: worker or spec.cluster.role: hybrid and a controller.endpoint is configured, the daemon automatically spawns a WorkerLifecycle background task. This task manages the full lifecycle of the worker's relationship with the cluster controller.

Lifecycle Stages

┌─────────┐     ┌──────────┐     ┌──────────┐     ┌───────────┐     ┌────────────┐
│ Connect  │────▶│  Attest  │────▶│ Register │────▶│ Heartbeat │────▶│ Deregister │
└─────────┘     └──────────┘     └──────────┘     │   Loop    │     └────────────┘
                                                   └───────────┘
  1. Connect — Establish a gRPC channel to the controller's NodeClusterService on port 50056.
  2. Attest — Perform the two-step Ed25519 challenge handshake (AttestNode + ChallengeNode). On success, the worker receives a NodeSecurityToken JWT.
  3. Register — Call RegisterNode to advertise the worker's NodeCapabilityAdvertisement (GPU count, CPU cores, memory, supported runtimes, tags). The controller creates initial NodeConfigAssignment and RuntimeRegistryAssignment records for the worker via the NodeRegistryRepository, separating cluster membership ("is this node alive?") from configuration assignment ("what should this node run?").
  4. Heartbeat Loop — Send Heartbeat RPCs at the configured heartbeat_interval_secs (default 30s). Process any NodeCommands returned in the response:
    • Drain — Stop accepting new executions; complete in-flight work.
    • PushConfig — Apply a configuration update from the controller.
    • Shutdown — Begin graceful process shutdown after draining.
  5. Deregister — On daemon shutdown (SIGTERM / Ctrl+C), call DeregisterNode to cleanly remove the worker from the cluster.

If a heartbeat fails (e.g., network partition), the worker logs a warning and retries on the next interval. The controller's HealthSweeper will mark the worker as Unhealthy after 3 missed intervals.

Configuration

The worker lifecycle is controlled by the spec.cluster section of aegis-config.yaml:

spec:
  cluster:
    enabled: true
    role: worker
    controller:
      endpoint: "http://controller.internal:50056"
    node_keypair_path: /etc/aegis/node-keypair.pem
    heartbeat_interval_secs: 30       # How often to send heartbeats
    token_refresh_margin_secs: 120    # Re-attest this many seconds before token expiry

See Also

On this page