Observability
Observability stack (Jaeger, Prometheus, Grafana, Loki), structured logging, OTLP export, alert rules, and metrics for AEGIS deployments.
Observability
AEGIS provides a comprehensive observability stack deployed as the pod-observability Podman pod, plus native structured logging and optional OTLP export from the orchestrator itself. This page covers both the deployed observability infrastructure and the orchestrator's telemetry output.
Observability Stack
The pod-observability pod in aegis-deploy bundles five components for production monitoring:
| Component | Version | Port | Purpose |
|---|---|---|---|
| Jaeger | 1.55 | 16686 (UI), 4317 (OTLP gRPC), 4318 (OTLP HTTP) | Distributed tracing |
| Prometheus | 2.51 | 9090 | Metrics collection and alerting |
| Grafana | 10.4 | 3300 | Dashboards and visualization |
| Loki | 3.0 | 3100 | Log aggregation |
| Promtail | 3.0 | 9080 | Container log scraping |
Prometheus Scrape Targets
Prometheus is pre-configured to scrape all AEGIS platform services:
| Target | Port | Endpoint |
|---|---|---|
| aegis-runtime | 9091 | /metrics |
| keycloak | 8180 | /metrics |
| seaweedfs-master | 9324 | /metrics |
| seaweedfs-volume | 9325 | /metrics |
| seaweedfs-filer | 9326 | /metrics |
| openbao | 8200 | /v1/sys/metrics |
| temporal | 7233 | /metrics |
| postgres-exporter | 9187 | /metrics |
Global scrape interval: 15 seconds. Evaluation interval: 15 seconds. Retention: 15 days.
Alert Rules
Pre-configured Prometheus alert rules shipped with aegis-deploy:
Critical (1–2 minute threshold):
| Alert | Condition | Severity |
|---|---|---|
AEGISRuntimeDown | Runtime unreachable for 1 min | critical |
PostgreSQLDown | Database unreachable for 1 min | critical |
TemporalDown | Workflow engine unreachable for 2 min | critical |
KeycloakDown | IAM provider unreachable for 1 min | critical |
Warning (5 minute threshold):
| Alert | Condition | Severity |
|---|---|---|
HighExecutionFailureRate | >25% execution failures | warning |
HighHTTPErrorRate | >5% HTTP 5xx responses | warning |
HighAPILatency | P95 latency >5 seconds | warning |
EventBusLagHigh | Lag rate >100 events/s | warning |
SeaweedFSDown | Storage unreachable for 5 min | warning |
OpenBaoDown | Secrets manager unreachable for 5 min | warning |
HighPostgresConnections | >80 active connections | warning |
Grafana Dashboards
Three dashboards are auto-provisioned:
AEGIS Overview — Active executions, execution rate by status, iteration rate, agent lifecycle operations, event bus throughput, HTTP/gRPC request rates and latency percentiles, node info.
Infrastructure — Service up/down status for all components, PostgreSQL connections and cache hit ratio, Keycloak login events and sessions, SeaweedFS volume count and disk usage, Temporal workflow starts/completions/failures.
Logs Explorer — Log volume by container, full log stream, error log volume and filtered error view. Uses Loki as datasource.
Grafana is accessible at port 3300 with anonymous viewer access enabled by default. Datasources (Prometheus, Loki, Jaeger) are auto-configured.
Log Aggregation with Loki
Promtail scrapes all Podman container logs from /var/log/podman-containers/ and ships them to Loki. Logs are parsed using Docker log format, with container name labels extracted automatically.
- Retention: 7 days (168 hours)
- Schema: TSDB v13 with daily index periods
- Query: Use Grafana's Explore view or the Logs Explorer dashboard
# View logs for a specific container via Grafana
# Navigate to: http://localhost:3300 → Explore → Loki
# Query: {container_name="aegis-runtime"}Log Levels
AEGIS uses Rust's RUST_LOG environment variable, which accepts a comma-separated list of level directives:
| Level | Usage |
|---|---|
error | Unrecoverable failures (operation panics, infrastructure unavailability) |
warn | Recoverable issues: missing optional config, NFS deregistration lag, LLM provider degraded |
info | Normal lifecycle events: server started, execution completed, volume cleaned up |
debug | Per-request details: tool routing decisions, SEAL validation, storage path resolution |
trace | Verbose internal state: usually too noisy for production |
Recommended Settings
# Production
RUST_LOG=info
# Debug a specific subsystem
RUST_LOG=info,aegis_orchestrator_core::infrastructure::nfs=debug
# Debug all tool routing
RUST_LOG=info,aegis_orchestrator_core::infrastructure::tool_router=debug
# Debug tool-call judging
RUST_LOG=info,aegis_orchestrator_core::application::tool_invocation_service=debug,aegis_orchestrator_core::application::validation_service=debug
# Verbose SEAL audit
RUST_LOG=info,aegis_orchestrator_core::infrastructure::seal=debug
# Development (everything)
RUST_LOG=debugDirective syntax: [crate::path=]level[,...]. Omitting the crate path sets a global minimum level.
Log Formats
AEGIS supports two output formats controlled at startup:
Pretty (default for development)
Human-readable colored text. Suitable for local development and docker logs:
2026-01-15T10:23:45.123Z INFO aegis_orchestrator: Starting gRPC server on 0.0.0.0:50051
2026-01-15T10:23:46.200Z INFO aegis_orchestrator: Connected to Cortex gRPC service url=http://cortex:50052
2026-01-15T10:23:47.001Z WARN aegis_orchestrator: Started with NO LLM providers configured. Agent execution will fail!JSON (production)
Newline-delimited JSON; parseable by log aggregators:
{"timestamp":"2026-01-15T10:23:45.123Z","level":"INFO","target":"aegis_orchestrator","message":"Starting gRPC server on 0.0.0.0:50051"}
{"timestamp":"2026-01-15T10:23:46.200Z","level":"INFO","target":"aegis_orchestrator","fields":{"url":"http://cortex:50052"},"message":"Connected to Cortex gRPC service"}Enable JSON format by setting the AEGIS_LOG_FORMAT environment variable:
AEGIS_LOG_FORMAT=jsonIf unset or set to any other value, the pretty format is used.
Structured Fields
Many log events include structured key-value fields alongside the message. These are available in both formats:
| Field | Events | Description |
|---|---|---|
url | Service connection events | Target URL being connected to |
execution_id | Execution lifecycle | UUID of the active execution |
count | Volume cleanup | Number of volumes deleted |
err | Error events | Error description |
agent_id | Agent lifecycle | UUID of the agent |
When using JSON format, structured fields appear as keys in the JSON object under "fields".
Domain Events in Logs
AEGIS publishes structured domain events to its internal event bus. These events also produce log entries. Key observable events:
Execution Events
| Log Message Pattern | Level | Meaning |
|---|---|---|
"Starting execution" | INFO | Execution started |
"Inner loop generation failed" | ERROR | LLM generation failed for an iteration |
"Could not find execution {} for LLM event" | WARN | Race condition during execution lookup |
Volume Events
| Log Message Pattern | Level | Meaning |
|---|---|---|
"Volume cleanup: {} expired volumes deleted" | INFO | Periodic TTL cleanup completed |
"Volume cleanup failed" | ERROR | Cleanup task failed |
"NFS deregistration listener lagged" | WARN | Event bus buffer full; some deregistrations may have been missed |
Service Lifecycle
| Log Message Pattern | Level | Meaning |
|---|---|---|
"Starting gRPC server on {}" | INFO | gRPC server started |
"Starting AEGIS gRPC server on {}" | INFO | Internal gRPC server |
"Connected to Cortex gRPC service" | INFO | Cortex connection established |
"Cortex gRPC URL not configured" | INFO | Running in memoryless mode (expected when Cortex not deployed) |
"Failed to connect to Temporal" | ERROR | Temporal workflow engine unreachable |
"Failed to start some MCP servers" | ERROR | One or more MCP tool servers failed to start |
SEAL / Security Events
SEAL policy violations always produce WARN log entries with structured fields including execution_id, tool_name, and the violation type. These are produced by SealAudit:
{"level":"WARN","target":"aegis_orchestrator_core::infrastructure::seal::audit","fields":{"execution_id":"a1b2...","tool_name":"fs.delete","violation":"ToolExplicitlyDenied"},"message":"SEAL tool call blocked"}Tool-Call Judging
The runtime documentation and verified logging surface do not currently expose a dedicated judge-specific telemetry stream. Today, you can observe the following:
| Signal | What it tells you |
|---|---|
| Execution lifecycle logs | Whether the parent execution started, refined, completed, or failed. |
| Inner-loop generation logs | Whether the model returned a final response or hit a generation error. |
| SEAL audit logs | Whether a tool call was blocked by policy before routing. |
| Tool routing debug logs | Which routing path a tool took when debug logging is enabled. |
| Child execution logs | Whether a judge execution was spawned and how it completed, if you correlate by execution ID. |
Use these logs to infer tool-call judging behavior today. Do not assume a dedicated judge_execution_id, score, confidence, or decision field is emitted unless you verify it in the running build.
Recommended Future Work
If you want first-class operational visibility for tool-call judging, the next step is to add explicit judge telemetry in the orchestrator and surface it as structured logs or domain events. Useful fields would include the parent execution ID, the judge child execution ID, the tool name, the verdict score, the verdict confidence, and the final allow/block decision.
That work is recommended future instrumentation, not current runtime behavior.
Container Log Collection
The AEGIS daemon writes all logs to stdout and stderr. In the Podman pod deployment, Promtail automatically collects these logs and ships them to Loki. For standalone setups:
# Podman
podman logs -f aegis-runtime
# Docker
docker logs -f aegis-daemonFor additional log aggregation beyond the built-in Loki stack, configure your collector (Fluentd, Datadog Agent, etc.) to read container stdout and set AEGIS_LOG_FORMAT=json so log lines are parseable.
Health Checks
The REST API exposes health endpoints on port 8088:
curl http://localhost:8088/health
# → {"status":"ok"}
curl http://localhost:8088/health/live
# → liveness check
curl http://localhost:8088/health/ready
# → readiness check (all dependencies connected)In the Podman pod deployment, health checks are configured automatically for every container. Use make validate to check all services at once.
OTLP External Log Export
AEGIS can ship structured log records directly to any OpenTelemetry Protocol (OTLP)-compatible backend — Grafana Cloud, Datadog, Honeycomb, a self-hosted OpenTelemetry Collector, or any OTLP-native destination.
This feature is additive: stdout logging is always active. OTLP export is an optional second pipeline enabled by setting spec.observability.logging.otlp_endpoint in the node configuration (or the AEGIS_OTLP_ENDPOINT environment variable).
Quick Start
The minimum change to enable OTLP is adding otlp_endpoint to your node config:
spec:
observability:
logging:
level: info
format: json
otlp_endpoint: "http://otel-collector:4317" # gRPC (default protocol)Or via environment variable (no config change needed):
export AEGIS_OTLP_ENDPOINT=http://otel-collector:4317Protocol Selection
Two OTLP transports are supported, controlled by otlp_protocol (or AEGIS_OTLP_PROTOCOL):
| Protocol | Config value | Default port | Notes |
|---|---|---|---|
| gRPC | grpc (default) | 4317 | Preferred for self-hosted collectors |
| HTTP/Protobuf | http | 4318 | Required for some SaaS endpoints (Grafana Cloud, Datadog) |
logging:
otlp_endpoint: "https://otlp-gateway.grafana.net/v1/logs"
otlp_protocol: httpAuthentication
Use otlp_headers to pass API keys or other authentication metadata. Values support the standard env: and secret: credential prefixes:
logging:
otlp_endpoint: "https://otlp-gateway.grafana.net/v1/logs"
otlp_protocol: http
otlp_headers:
Authorization: "env:GRAFANA_OTLP_TOKEN" # resolved from env at startupWhen setting headers via the AEGIS_OTLP_HEADERS environment variable, use a comma-separated key=value list:
AEGIS_OTLP_HEADERS="Authorization=Bearer my-token,x-scope-orgid=12345"Never commit API keys or bearer tokens directly in your node config YAML. Always use env:VAR_NAME or secret:path credential prefixes, or set headers via the environment variable. See Credential Resolution.
Backend Integration Examples
Grafana Cloud Logs
logging:
otlp_endpoint: "https://otlp-gateway-prod-us-central-0.grafana.net/v1/logs"
otlp_protocol: http
otlp_headers:
Authorization: "env:GRAFANA_CLOUD_OTLP_TOKEN" # Basic base64(instanceId:apiKey)
otlp_service_name: "aegis-prod"Datadog
logging:
otlp_endpoint: "https://otlp.datadoghq.com/v1/logs"
otlp_protocol: http
otlp_headers:
DD-API-KEY: "env:DATADOG_API_KEY"
otlp_service_name: "aegis-prod"Self-Hosted OpenTelemetry Collector
logging:
otlp_endpoint: "http://otel-collector:4317" # gRPC default
otlp_service_name: "aegis-orchestrator"Your otel-collector-config.yaml can then fan out to Loki, Jaeger, Prometheus, S3, or any other exporter.
Example: OTel Collector to Grafana Loki
For SREs running a self-hosted observability stack, the following OTel Collector configuration demonstrates how to receive OTLP logs from AEGIS and fan them out to Grafana Loki while preserving structured attributes:
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 1s
send_batch_size: 1024
# Map OTLP attributes to Loki labels for efficient indexing
resource:
attributes:
- key: service.name
action: upsert
value: "aegis-orchestrator"
exporters:
loki:
endpoint: "http://loki:3100/loki/api/v1/push"
labels:
resource:
service.name: "service_name"
deployment.environment: "env"
attributes:
level: "level"
service:
pipelines:
logs:
receivers: [otlp]
processors: [batch, resource]
exporters: [loki]Minimum Export Level
Set otlp_min_level to filter how verbose the OTLP stream is, independently of stdout:
logging:
level: debug # verbose on stdout (useful during development)
otlp_min_level: info # only ship info+ to the backend (default)Accepted values: error, warn, info, debug, trace. Override with AEGIS_OTLP_LOG_LEVEL.
Resource Attributes
Every exported log record automatically includes the following OpenTelemetry resource attributes:
| Attribute | Source | Example |
|---|---|---|
service.name | otlp_service_name config key or AEGIS_OTLP_SERVICE_NAME | aegis-orchestrator |
service.version | Compiled binary version | 0.1.0-pre-alpha |
deployment.environment | metadata.labels.environment (when set) | production |
Batch Tuning
Log records are buffered and exported in batches. The defaults are suitable for most workloads; tune only when you observe dropped records or high-memory usage:
logging:
otlp_endpoint: "http://otel-collector:4317"
batch:
max_queue_size: 4096 # increase if logs spike during high-iteration runs
scheduled_delay_ms: 2000 # flush more frequently
max_export_batch_size: 512 # records per HTTP/gRPC call
export_timeout_ms: 10000 # per-call timeout| Field | Default | Description |
|---|---|---|
max_queue_size | 2048 | Maximum buffered records. Records are dropped if the queue is full. |
scheduled_delay_ms | 5000 | Flush interval in milliseconds. |
max_export_batch_size | 512 | Records per export RPC. |
export_timeout_ms | 10000 | Per-call timeout in milliseconds. |
TLS Configuration
For self-signed certificates or private CA chains:
logging:
otlp_endpoint: "https://otel-collector.internal:4317"
tls:
verify: true # keep true in production
ca_cert_path: /etc/aegis/internal-ca.pem # custom CA bundleSet verify: false only for local development; it disables all certificate validation.
Environment Variable Reference
All OTLP settings can be supplied (or overridden) at runtime via environment variables, without modifying the node config file:
| Variable | Config equivalent | Notes |
|---|---|---|
AEGIS_OTLP_ENDPOINT | logging.otlp_endpoint | Setting this variable enables OTLP export |
AEGIS_OTLP_PROTOCOL | logging.otlp_protocol | grpc or http |
AEGIS_OTLP_HEADERS | logging.otlp_headers | Comma-separated key=value pairs |
AEGIS_OTLP_LOG_LEVEL | logging.otlp_min_level | Min level exported to OTLP |
AEGIS_OTLP_SERVICE_NAME | logging.otlp_service_name | service.name resource attribute |
Metrics (Prometheus)
AEGIS exposes real-time operational metrics via a Prometheus-compatible endpoint. This allows you to monitor system health, execution performance, and security events using tools like Prometheus and Grafana.
The metrics endpoint is served over a dedicated HTTP listener, separate from the main API and gRPC ports.
Quick Start
By default, Prometheus metrics are enabled on port 9091. You can configure this in your node configuration:
spec:
observability:
metrics:
enabled: true
port: 9091
path: "/metrics"Or via environment variables:
export AEGIS_METRICS_ENABLED=true
export AEGIS_METRICS_PORT=9091Scraping with Prometheus
Add the AEGIS node to your prometheus.yml scrape configuration:
scrape_configs:
- job_name: 'aegis-orchestrator'
static_configs:
- targets: ['localhost:9091']Key Observable Metrics
AEGIS provides a wide range of metrics across different subsystems:
- Executions: Active execution count, total completions/failures, and duration histograms.
- SEAL Security: Policy violations, attestation success/failure rates, and session counts.
- Storage (NFS): File operation counts, latencies, and total bytes read/written.
- Workflows: Active workflow executions and state transition counters.
- System: Node uptime and static version/identity information.
For a complete list of available metrics, labels, and descriptions, see the Metrics Reference.
Security Note
The metrics endpoint is unauthenticated by design, following standard Prometheus patterns. Ensure that the metrics port (9091 by default) is protected by your network firewall or Kubernetes NetworkPolicy and is not exposed to the public internet.
See Also
- Node Configuration Reference — full
spec.observabilityschema with all OTLP fields - Configuration Reference —
RUST_LOGandAEGIS_LOG_FORMATenv vars - Multi-Node Deployment — log aggregation across nodes
- Pod Architecture — complete pod topology and health check reference
- Podman Deployment — platform deployment with Podman pods
- REST API Reference —
/healthendpoint