Observability stack (Jaeger, Prometheus, Grafana, Loki), structured logging, OTLP export, alert rules, and metrics for AEGIS deployments.

Observability

AEGIS provides a comprehensive observability stack deployed as the pod-observability Podman pod, plus native structured logging and optional OTLP export from the orchestrator itself. This page covers both the deployed observability infrastructure and the orchestrator's telemetry output.

Observability Stack

The pod-observability pod in aegis-deploy bundles five components for production monitoring:

Component	Version	Port	Purpose
Jaeger	1.55	16686 (UI), 4317 (OTLP gRPC), 4318 (OTLP HTTP)	Distributed tracing
Prometheus	2.51	9090	Metrics collection and alerting
Grafana	10.4	3300	Dashboards and visualization
Loki	3.0	3100	Log aggregation
Promtail	3.0	9080	Container log scraping

Prometheus Scrape Targets

Prometheus is pre-configured to scrape all AEGIS platform services:

Target	Port	Endpoint
aegis-runtime	9091	`/metrics`
keycloak	8180	`/metrics`
seaweedfs-master	9324	`/metrics`
seaweedfs-volume	9325	`/metrics`
seaweedfs-filer	9326	`/metrics`
openbao	8200	`/v1/sys/metrics`
temporal	7233	`/metrics`
postgres-exporter	9187	`/metrics`

Global scrape interval: 15 seconds. Evaluation interval: 15 seconds. Retention: 15 days.

Alert Rules

Pre-configured Prometheus alert rules shipped with aegis-deploy:

Critical (1–2 minute threshold):

Alert	Condition	Severity
`AEGISRuntimeDown`	Runtime unreachable for 1 min	critical
`PostgreSQLDown`	Database unreachable for 1 min	critical
`TemporalDown`	Workflow engine unreachable for 2 min	critical
`KeycloakDown`	IAM provider unreachable for 1 min	critical

Warning (5 minute threshold):

Alert	Condition	Severity
`HighExecutionFailureRate`	>25% execution failures	warning
`HighHTTPErrorRate`	>5% HTTP 5xx responses	warning
`HighAPILatency`	P95 latency >5 seconds	warning
`EventBusLagHigh`	Lag rate >100 events/s	warning
`SeaweedFSDown`	Storage unreachable for 5 min	warning
`OpenBaoDown`	Secrets manager unreachable for 5 min	warning
`HighPostgresConnections`	>80 active connections	warning

Grafana Dashboards

Three dashboards are auto-provisioned:

AEGIS Overview — Active executions, execution rate by status, iteration rate, agent lifecycle operations, event bus throughput, HTTP/gRPC request rates and latency percentiles, node info.

Infrastructure — Service up/down status for all components, PostgreSQL connections and cache hit ratio, Keycloak login events and sessions, SeaweedFS volume count and disk usage, Temporal workflow starts/completions/failures.

Logs Explorer — Log volume by container, full log stream, error log volume and filtered error view. Uses Loki as datasource.

Grafana is accessible at port 3300 with anonymous viewer access enabled by default. Datasources (Prometheus, Loki, Jaeger) are auto-configured.

Log Aggregation with Loki

Promtail scrapes all Podman container logs from /var/log/podman-containers/ and ships them to Loki. Logs are parsed using Docker log format, with container name labels extracted automatically.

Retention: 7 days (168 hours)
Schema: TSDB v13 with daily index periods
Query: Use Grafana's Explore view or the Logs Explorer dashboard

# View logs for a specific container via Grafana
# Navigate to: http://localhost:3300 → Explore → Loki
# Query: {container_name="aegis-runtime"}

Log Levels

AEGIS uses Rust's RUST_LOG environment variable, which accepts a comma-separated list of level directives:

Level	Usage
`error`	Unrecoverable failures (operation panics, infrastructure unavailability)
`warn`	Recoverable issues: missing optional config, NFS deregistration lag, LLM provider degraded
`info`	Normal lifecycle events: server started, execution completed, volume cleaned up
`debug`	Per-request details: tool routing decisions, SEAL validation, storage path resolution
`trace`	Verbose internal state: usually too noisy for production

Recommended Settings

# Production
RUST_LOG=info

# Debug a specific subsystem
RUST_LOG=info,aegis_orchestrator_core::infrastructure::nfs=debug

# Debug all tool routing
RUST_LOG=info,aegis_orchestrator_core::infrastructure::tool_router=debug

# Debug tool-call judging
RUST_LOG=info,aegis_orchestrator_core::application::tool_invocation_service=debug,aegis_orchestrator_core::application::validation_service=debug

# Verbose SEAL audit
RUST_LOG=info,aegis_orchestrator_core::infrastructure::seal=debug

# Development (everything)
RUST_LOG=debug

Directive syntax: [crate::path=]level[,...]. Omitting the crate path sets a global minimum level.

Log Formats

AEGIS supports two output formats controlled at startup:

Pretty (default for development)

Human-readable colored text. Suitable for local development and docker logs:

2026-01-15T10:23:45.123Z  INFO aegis_orchestrator: Starting gRPC server on 0.0.0.0:50051
2026-01-15T10:23:46.200Z  INFO aegis_orchestrator: Connected to Cortex gRPC service url=http://cortex:50052
2026-01-15T10:23:47.001Z  WARN aegis_orchestrator: Started with NO LLM providers configured. Agent execution will fail!

JSON (production)

Newline-delimited JSON; parseable by log aggregators:

{"timestamp":"2026-01-15T10:23:45.123Z","level":"INFO","target":"aegis_orchestrator","message":"Starting gRPC server on 0.0.0.0:50051"}
{"timestamp":"2026-01-15T10:23:46.200Z","level":"INFO","target":"aegis_orchestrator","fields":{"url":"http://cortex:50052"},"message":"Connected to Cortex gRPC service"}

Enable JSON format by setting the AEGIS_LOG_FORMAT environment variable:

AEGIS_LOG_FORMAT=json

If unset or set to any other value, the pretty format is used.

Structured Fields

Many log events include structured key-value fields alongside the message. These are available in both formats:

Field	Events	Description
`url`	Service connection events	Target URL being connected to
`execution_id`	Execution lifecycle	UUID of the active execution
`count`	Volume cleanup	Number of volumes deleted
`err`	Error events	Error description
`agent_id`	Agent lifecycle	UUID of the agent

When using JSON format, structured fields appear as keys in the JSON object under "fields".

Domain Events in Logs

AEGIS publishes structured domain events to its internal event bus. These events also produce log entries. Key observable events:

Execution Events

Log Message Pattern	Level	Meaning
`"Starting execution"`	INFO	Execution started
`"Inner loop generation failed"`	ERROR	LLM generation failed for an iteration
`"Could not find execution {} for LLM event"`	WARN	Race condition during execution lookup

Volume Events

Log Message Pattern	Level	Meaning
`"Volume cleanup: {} expired volumes deleted"`	INFO	Periodic TTL cleanup completed
`"Volume cleanup failed"`	ERROR	Cleanup task failed
`"NFS deregistration listener lagged"`	WARN	Event bus buffer full; some deregistrations may have been missed

Service Lifecycle

Log Message Pattern	Level	Meaning
`"Starting gRPC server on {}"`	INFO	gRPC server started
`"Starting AEGIS gRPC server on {}"`	INFO	Internal gRPC server
`"Connected to Cortex gRPC service"`	INFO	Cortex connection established
`"Cortex gRPC URL not configured"`	INFO	Running in memoryless mode (expected when Cortex not deployed)
`"Failed to connect to Temporal"`	ERROR	Temporal workflow engine unreachable
`"Failed to start some MCP servers"`	ERROR	One or more MCP tool servers failed to start

SEAL / Security Events

SEAL policy violations always produce WARN log entries with structured fields including execution_id, tool_name, and the violation type. These are produced by SealAudit:

{"level":"WARN","target":"aegis_orchestrator_core::infrastructure::seal::audit","fields":{"execution_id":"a1b2...","tool_name":"fs.delete","violation":"ToolExplicitlyDenied"},"message":"SEAL tool call blocked"}

Tool-Call Judging

The runtime documentation and verified logging surface do not currently expose a dedicated judge-specific telemetry stream. Today, you can observe the following:

Signal	What it tells you
Execution lifecycle logs	Whether the parent execution started, refined, completed, or failed.
Inner-loop generation logs	Whether the model returned a final response or hit a generation error.
SEAL audit logs	Whether a tool call was blocked by policy before routing.
Tool routing debug logs	Which routing path a tool took when debug logging is enabled.
Child execution logs	Whether a judge execution was spawned and how it completed, if you correlate by execution ID.

Use these logs to infer tool-call judging behavior today. Do not assume a dedicated judge_execution_id, score, confidence, or decision field is emitted unless you verify it in the running build.

Recommended Future Work

If you want first-class operational visibility for tool-call judging, the next step is to add explicit judge telemetry in the orchestrator and surface it as structured logs or domain events. Useful fields would include the parent execution ID, the judge child execution ID, the tool name, the verdict score, the verdict confidence, and the final allow/block decision.

That work is recommended future instrumentation, not current runtime behavior.

Container Log Collection

The AEGIS daemon writes all logs to stdout and stderr. In the Podman pod deployment, Promtail automatically collects these logs and ships them to Loki. For standalone setups:

# Podman
podman logs -f aegis-runtime

# Docker
docker logs -f aegis-daemon

For additional log aggregation beyond the built-in Loki stack, configure your collector (Fluentd, Datadog Agent, etc.) to read container stdout and set AEGIS_LOG_FORMAT=json so log lines are parseable.

Health Checks

The REST API exposes health endpoints on port 8088:

curl http://localhost:8088/health
# → {"status":"ok"}

curl http://localhost:8088/health/live
# → liveness check

curl http://localhost:8088/health/ready
# → readiness check (all dependencies connected)

In the Podman pod deployment, health checks are configured automatically for every container. Use make validate to check all services at once.

OTLP External Log Export

AEGIS can ship structured log records directly to any OpenTelemetry Protocol (OTLP)-compatible backend — Grafana Cloud, Datadog, Honeycomb, a self-hosted OpenTelemetry Collector, or any OTLP-native destination.

This feature is additive: stdout logging is always active. OTLP export is an optional second pipeline enabled by setting spec.observability.logging.otlp_endpoint in the node configuration (or the AEGIS_OTLP_ENDPOINT environment variable).

Quick Start

The minimum change to enable OTLP is adding otlp_endpoint to your node config:

spec:
  observability:
    logging:
      level: info
      format: json
      otlp_endpoint: "http://otel-collector:4317"   # gRPC (default protocol)

Or via environment variable (no config change needed):

export AEGIS_OTLP_ENDPOINT=http://otel-collector:4317

Protocol Selection

Two OTLP transports are supported, controlled by otlp_protocol (or AEGIS_OTLP_PROTOCOL):

Protocol	Config value	Default port	Notes
gRPC	`grpc` (default)	`4317`	Preferred for self-hosted collectors
HTTP/Protobuf	`http`	`4318`	Required for some SaaS endpoints (Grafana Cloud, Datadog)

logging:
  otlp_endpoint: "https://otlp-gateway.grafana.net/v1/logs"
  otlp_protocol: http

Authentication

Use otlp_headers to pass API keys or other authentication metadata. Values support the standard env: and secret: credential prefixes:

logging:
  otlp_endpoint: "https://otlp-gateway.grafana.net/v1/logs"
  otlp_protocol: http
  otlp_headers:
    Authorization: "env:GRAFANA_OTLP_TOKEN"   # resolved from env at startup

When setting headers via the AEGIS_OTLP_HEADERS environment variable, use a comma-separated key=value list:

AEGIS_OTLP_HEADERS="Authorization=Bearer my-token,x-scope-orgid=12345"

Never commit API keys or bearer tokens directly in your node config YAML. Always use env:VAR_NAME or secret:path credential prefixes, or set headers via the environment variable. See Credential Resolution.

Backend Integration Examples

Grafana Cloud Logs

logging:
  otlp_endpoint: "https://otlp-gateway-prod-us-central-0.grafana.net/v1/logs"
  otlp_protocol: http
  otlp_headers:
    Authorization: "env:GRAFANA_CLOUD_OTLP_TOKEN"   # Basic base64(instanceId:apiKey)
  otlp_service_name: "aegis-prod"

Datadog

logging:
  otlp_endpoint: "https://otlp.datadoghq.com/v1/logs"
  otlp_protocol: http
  otlp_headers:
    DD-API-KEY: "env:DATADOG_API_KEY"
  otlp_service_name: "aegis-prod"

Self-Hosted OpenTelemetry Collector

logging:
  otlp_endpoint: "http://otel-collector:4317"   # gRPC default
  otlp_service_name: "aegis-orchestrator"

Your otel-collector-config.yaml can then fan out to Loki, Jaeger, Prometheus, S3, or any other exporter.

Example: OTel Collector to Grafana Loki

For SREs running a self-hosted observability stack, the following OTel Collector configuration demonstrates how to receive OTLP logs from AEGIS and fan them out to Grafana Loki while preserving structured attributes:

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024
  
  # Map OTLP attributes to Loki labels for efficient indexing
  resource:
    attributes:
      - key: service.name
        action: upsert
        value: "aegis-orchestrator"

exporters:
  loki:
    endpoint: "http://loki:3100/loki/api/v1/push"
    labels:
      resource:
        service.name: "service_name"
        deployment.environment: "env"
      attributes:
        level: "level"

service:
  pipelines:
    logs:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [loki]

Minimum Export Level

Set otlp_min_level to filter how verbose the OTLP stream is, independently of stdout:

logging:
  level: debug          # verbose on stdout (useful during development)
  otlp_min_level: info  # only ship info+ to the backend (default)

Accepted values: error, warn, info, debug, trace. Override with AEGIS_OTLP_LOG_LEVEL.

Resource Attributes

Every exported log record automatically includes the following OpenTelemetry resource attributes:

Attribute	Source	Example
`service.name`	`otlp_service_name` config key or `AEGIS_OTLP_SERVICE_NAME`	`aegis-orchestrator`
`service.version`	Compiled binary version	`0.1.0-pre-alpha`
`deployment.environment`	`metadata.labels.environment` (when set)	`production`

Batch Tuning

Log records are buffered and exported in batches. The defaults are suitable for most workloads; tune only when you observe dropped records or high-memory usage:

logging:
  otlp_endpoint: "http://otel-collector:4317"
  batch:
    max_queue_size: 4096        # increase if logs spike during high-iteration runs
    scheduled_delay_ms: 2000    # flush more frequently
    max_export_batch_size: 512  # records per HTTP/gRPC call
    export_timeout_ms: 10000    # per-call timeout

Field	Default	Description
`max_queue_size`	`2048`	Maximum buffered records. Records are dropped if the queue is full.
`scheduled_delay_ms`	`5000`	Flush interval in milliseconds.
`max_export_batch_size`	`512`	Records per export RPC.
`export_timeout_ms`	`10000`	Per-call timeout in milliseconds.

TLS Configuration

For self-signed certificates or private CA chains:

logging:
  otlp_endpoint: "https://otel-collector.internal:4317"
  tls:
    verify: true                            # keep true in production
    ca_cert_path: /etc/aegis/internal-ca.pem  # custom CA bundle

Set verify: false only for local development; it disables all certificate validation.

Environment Variable Reference

All OTLP settings can be supplied (or overridden) at runtime via environment variables, without modifying the node config file:

Variable	Config equivalent	Notes
`AEGIS_OTLP_ENDPOINT`	`logging.otlp_endpoint`	Setting this variable enables OTLP export
`AEGIS_OTLP_PROTOCOL`	`logging.otlp_protocol`	`grpc` or `http`
`AEGIS_OTLP_HEADERS`	`logging.otlp_headers`	Comma-separated `key=value` pairs
`AEGIS_OTLP_LOG_LEVEL`	`logging.otlp_min_level`	Min level exported to OTLP
`AEGIS_OTLP_SERVICE_NAME`	`logging.otlp_service_name`	`service.name` resource attribute

Metrics (Prometheus)

AEGIS exposes real-time operational metrics via a Prometheus-compatible endpoint. This allows you to monitor system health, execution performance, and security events using tools like Prometheus and Grafana.

The metrics endpoint is served over a dedicated HTTP listener, separate from the main API and gRPC ports.

Quick Start

By default, Prometheus metrics are enabled on port 9091. You can configure this in your node configuration:

spec:
  observability:
    metrics:
      enabled: true
      port: 9091
      path: "/metrics"

Or via environment variables:

export AEGIS_METRICS_ENABLED=true
export AEGIS_METRICS_PORT=9091

Scraping with Prometheus

Add the AEGIS node to your prometheus.yml scrape configuration:

scrape_configs:
  - job_name: 'aegis-orchestrator'
    static_configs:
      - targets: ['localhost:9091']

Key Observable Metrics

AEGIS provides a wide range of metrics across different subsystems:

Executions: Active execution count, total completions/failures, and duration histograms.
SEAL Security: Policy violations, attestation success/failure rates, and session counts.
Storage (NFS): File operation counts, latencies, and total bytes read/written.
Workflows: Active workflow executions and state transition counters.
System: Node uptime and static version/identity information.

For a complete list of available metrics, labels, and descriptions, see the Metrics Reference.

Security Note

The metrics endpoint is unauthenticated by design, following standard Prometheus patterns. Ensure that the metrics port (9091 by default) is protected by your network firewall or Kubernetes NetworkPolicy and is not exposed to the public internet.

Observability

On this page