Building SRE Agent Swarm: A Multi-Agent Self-Healing Infrastructure Platform
A deep, implementation-level breakdown of how I built SRE Agent Swarm: six cooperating agents, incident FSM orchestration, safety gates, runbook execution, human approvals, and chaos-driven MTTD/MTTR scoring.
GitHub Repository: View Source Code
Most incident response tooling is still either:
- Monitoring-only (great at detection, weak at response)
- Automation-only (great at scripted actions, risky without context)
- Human-heavy (safe, but slow during pager fatigue)
I wanted a middle path: an event-driven control plane that can detect, diagnose, and remediate incidents automatically, but with explicit safety boundaries and human approval gates for risky actions.
That became SRE Agent Swarm.
This is not a toy chatbot attached to logs. It is a system project with:
- A 10+ service failure playground (Python, Go, Node.js, Django)
- A 6-agent incident-response swarm
- Formal incident lifecycle FSM with retries, timeouts, and escalation
- Runbook-driven remediation pipeline with policy enforcement
- Dashboard + WebSocket human-in-the-loop control
- Chaos scenarios with MTTD/MTTR scoring
The rest of this post is a full architecture and implementation walkthrough.
1. System Goals and Constraints
Functional goals
- Detect abnormal behavior quickly (metrics, logs, health, synthetic probes)
- Produce machine-readable root-cause hypotheses
- Map diagnosis to safe and verifiable remediation actions
- Keep a full timeline + postmortem trail for operators
- Learn from previous incidents and improve suggestions over time
Non-goals
- Replacing all SRE judgement in high-risk production scenarios
- Fully autonomous execution of high-blast-radius actions
- Perfect diagnosis certainty in every ambiguous failure mode
Constraints that shaped the design
- Incidents are multi-signal and noisy; false positives are costly
- Single-agent architectures become bottlenecks and are hard to reason about
- Purely synchronous pipelines couple components too tightly
- "Auto-fix" systems without policy gates are dangerous
This drove me toward a multi-agent, event-driven architecture.
2. High-Level Architecture
At a high level there are three planes:
- Target plane: application microservices + workers + datastores
- Control plane: observer/diagnoser/remediator/safety/orchestrator/learner
- Operator plane: dashboard API + WebSocket UI + approval workflow
graph TD
A[Microservices + Workers] --> O[Observer Pool]
O --> NATS[NATS JetStream]
NATS --> D[Diagnoser]
D --> R[Remediator]
R --> S[Safety]
S --> ORCH[Orchestrator FSM]
ORCH --> DB[(PostgreSQL incidents)]
ORCH --> UI[Dashboard API + Frontend]
S --> H[Human Approvals]
ORCH --> L[Learner]Service environment monitored by the swarm
- Gateway:
api-gateway(:8000) - Core services:
user-svc(:8001),order-svc(:8002),auth-svc(:8004),payment-svc(:8005) - Extended services:
product-svc(:8003),search-svc(:8006) - Workers:
notification-worker(:8007),inventory-worker(:8008),analytics-worker(:8009)
Observability stack:
- Prometheus (
:9090) - Grafana (
:3000) - Loki (
:3100) - Tempo (
:3200) - AlertManager (
:9093)
Infrastructure:
- PostgreSQL (incident + service data)
- Redis (cache/session + counters)
- Elasticsearch (search workload)
- NATS JetStream (control-plane messaging backbone)
3. Why Event-Driven Messaging (NATS JetStream)
I did not want synchronous RPC chains like:
Observer -> Diagnoser -> Remediator -> Safety -> Orchestrator
That pattern is fragile under partial failures and creates tight coupling between agent runtimes.
Instead, each agent publishes and consumes typed messages on subjects. The envelope uses correlation IDs to keep incident context stitched end-to-end.
Core message envelope (conceptual)
{
"message_id": "uuid",
"correlation_id": "incident_id",
"trace_id": "distributed_trace_key",
"source_agent": "agents.observer",
"target_agent": "agents.diagnoser",
"message_type": "anomaly_detected",
"priority": 1,
"ttl_seconds": 120,
"timestamp": "2026-03-08T10:00:00Z",
"payload": {},
"context": {}
}Important subjects
agents.observer.anomaliesagents.diagnoser.requestsagents.diagnoser.resultsagents.safety.reviewsagents.safety.decisionsagents.remediator.executionsincidents.lifecyclehuman.approvalshuman.approvals.responses
Streams
AGENTS: agent traffic + heartbeatINCIDENTS: lifecycle transitionsHUMAN: approval requests/responsesBUSINESS: domain events (orders.created,payments.failed, etc.)
The key operational benefit: each agent can fail, restart, or scale independently without collapsing the full incident pipeline.
4. Agent Roles and Contracts
The swarm has six specialized roles.
4.1 Observer
Observer is actually a pool of detectors:
- metrics observer
- log observer
- health observer
- synthetic prober
Main techniques:
- Dynamic z-score anomaly detection over sliding windows
- Static threshold checks for known hard limits
- Deduplication to suppress alert storms
- Trend prediction for early warning (
trend_breach_predicted)
Representative detector behavior:
# dynamic mode: compare new value to prior baseline
mean = statistics.mean(baseline)
stdev = statistics.stdev(baseline)
z_score = (value - mean) / stdev
if abs(z_score) >= threshold:
emit_anomaly()This avoids hardcoding every signal to a static threshold and adapts to changing baselines.
4.2 Diagnoser
Diagnoser transforms raw anomalies into structured RCA hypotheses.
Pipeline:
- Correlate related anomalies into an incident context
- Gather supporting context from metrics/logs/dependencies
- Generate root-cause hypotheses
- Run debate/scoring if confidence is weak
- Publish diagnosis + confidence
LLM strategy is layered:
- Primary: Gemini backend
- Fallback: OpenAI backend
- Last fallback: deterministic heuristics
That fallback chain is practical: diagnosis never hard-fails because one external model backend is down.
4.3 Remediator
Remediator maps diagnosis to runbooks and executes actions.
Core components:
runbook_engine.pyaction_executor.pyverification_engine.pyrollback_manager.py
Runbooks are YAML and matched by root-cause category + minimum confidence.
id: runbook_memory_leak
matches:
root_cause_category: "memory_leak"
confidence_minimum: 50
actions:
- type: "container_restart"
params:
target: "{{diagnosis.root_cause.service}}"
risk: "low"
approval_required: falseParameter templating lets one runbook apply to many services while still remaining explicit.
4.4 Safety
Safety is the most important guardrail in the system.
It decides whether a proposed remediation is:
approvedrejectedpending_human_approval
Current safety stack:
- Policy engine (hard allow/deny logic)
- Blast radius calculation
- Rate limiting / anti-loop controls
- Approval gateway for operator handoff
Representative policy checks:
- High-risk actions require human approval
- Explicit
approval_required: trueforces manual gate - Certain actions against critical targets are blocked
- Scale operations are capped by replica limit
This turns "autonomous remediation" into bounded autonomy.
4.5 Orchestrator
Orchestrator is the control-plane brain:
- owns lifecycle state transitions
- routes work between agents
- handles retries/timeouts/escalation
- records incident timeline and postmortem payloads
It subscribes to events from observer, diagnoser, safety, remediator, heartbeat, and human approvals.
4.6 Learner
Learner captures incident outcomes and creates long-term memory:
- vectorizes incidents into ChromaDB
- retrieves similar incident patterns
- tracks runbook success and MTTR outcomes
This closes the loop from incident response to incident intelligence.
5. Formal Incident Lifecycle (FSM)
I used an explicit finite state machine instead of ad-hoc status strings.
States:
detectingdiagnosingproposing_remediationsafety_reviewexecutingverifyingresolvedclosed
Allowed transitions are strict (for example, no direct jump from diagnosing to executing).
TRANSITIONS = {
"detecting": {"diagnosing"},
"diagnosing": {"proposing_remediation"},
"proposing_remediation": {"safety_review"},
"safety_review": {"executing", "proposing_remediation"},
"executing": {"verifying", "proposing_remediation"},
"verifying": {"resolved", "executing"},
"resolved": {"closed"},
}Timeouts are state-specific (for example diagnosing: 180s, safety_review: 300s).
Retries are bounded (MAX_RETRIES_PER_STATE = 2).
This gives deterministic behavior under failure and simplifies testing.
6. End-to-End Control Flow
A typical incident journey:
- Observer emits anomaly on
agents.observer.anomalies - Orchestrator creates incident record and transitions to
diagnosing - Diagnoser publishes RCA and confidence
- Orchestrator transitions to
proposing_remediation - Remediator proposes runbook action
- Safety approves/rejects/escalates
- If approved, remediator executes + verifies
- Orchestrator resolves incident or loops with retries
- Timeline and postmortem are persisted
Human approval path
sequenceDiagram
participant R as Remediator
participant S as Safety
participant H as Human Queue
participant D as Dashboard Operator
participant O as Orchestrator
R->>S: action proposal
S->>H: human.approvals (pending_human_approval)
D->>H: approve/reject via dashboard
H->>O: human.approvals.responses
O->>R: proceed or retryThis keeps operators in control for high-impact actions without blocking low-risk auto-remediations.
7. Data Model and Auditability
Incident state is persisted in PostgreSQL with key fields like:
- status/state
- severity
- diagnosis + confidence
- root cause service/category
- remediation actions
- timeline events
- postmortem payload
Why this matters:
- exact replayability of incident progression
- clear audit trail for automation decisions
- easier post-incident reviews and runbook improvements
The timeline builder appends structured events at major transitions (anomaly_detected, diagnosis_complete, action_executed, verification_passed, resolved).
8. Dashboard and Operator Experience
I built a dedicated dashboard plane:
- FastAPI backend (
dashboard/api) - React frontend (
dashboard/frontend) - WebSocket manager for live updates
UI surfaces:
- Active incidents and state
- Agent health and heartbeats
- Approval queue for pending actions
- Timeline view for each incident
The dashboard is not just observability UI; it is part of the control path for human decisions.
9. Chaos Engineering and Scoring
To avoid "works in demo" syndrome, I added a chaos runner that injects failures and measures the swarm response.
Scenarios currently include:
memory_leakcpu_spikenetwork_partitiondb_overload
Runner behavior:
- Inject scenario
- Poll incident DB
- Compute MTTD and MTTR
- Cleanup
- Emit markdown report
Scoring rubric (letter grade):
- A: fast detection + fast remediation
- B: acceptable response
- C: detected but slow
- F: undetected or timeout
This makes reliability progress measurable, not anecdotal.
10. Build and Operations Surface
The Makefile became the operational contract of the project:
make infra-up # postgres/redis/nats
make init-nats # bootstrap streams
make obs-up # prometheus/grafana/loki/tempo/alertmanager
make up # microservices playground
make agents-up # agent swarm
make dashboard-up # dashboard api + frontend
make test # unit and service tests
make health # endpoint checksThis reduced setup friction significantly for local reproducibility.
11. Testing Strategy
I split tests by concern:
- Unit tests for core modules (detector, FSM, policy, routing)
- Agent integration tests around orchestrator flow
- Service-level tests for microservices
- Chaos runs for system behavior under fault
Notable high-leverage test targets:
- FSM invalid transition prevention
- Retry/timeout escalation behavior
- Safety policy deny paths
- Runbook matching by category/confidence
The major gain from this structure is confidence when refactoring agent interactions.
12. Key Design Tradeoffs
Multi-agent vs single intelligent agent
- Multi-agent increases coordination overhead
- But gives clean specialization and lower cognitive coupling
- Failures are more isolatable and testable
YAML runbooks vs fully generated actions
- YAML is less "smart" than free-form generation
- But far safer and auditable
- Easier for SRE teams to review and diff
Strict FSM vs flexible workflow
- FSM can feel rigid initially
- But massively helps with correctness, retries, and incident replay
Human gate for high-risk actions
- Slower than full automation
- Necessary to avoid high-blast-radius mistakes
13. What I Learned
Three practical lessons stood out:
-
Safety architecture is as important as diagnosis quality. A good RCA engine without policy controls is still unsafe in practice.
-
Correlation IDs are non-negotiable in event-driven systems. Without them, debugging and timeline assembly quickly become unmanageable.
-
Incident systems need lifecycle rigor, not just model intelligence. Formal state transitions did more for reliability than adding model complexity.
14. What I Would Improve Next
Near-term roadmap:
- richer blast-radius modeling using service dependency graphs
- stronger runbook verification with canary-style post-checks
- tighter learner feedback loop into diagnosis confidence weighting
- deeper SLO/SLA aware remediation prioritization
- better replay tooling for incident simulation from recorded timelines
15. Closing
SRE Agent Swarm is my attempt to treat incident response as a systems problem, not just a prompt-engineering problem.
The core idea is simple: combine distributed systems discipline (state machines, message contracts, idempotent flows, retries) with AI-assisted diagnosis, then constrain execution through hard safety boundaries.
That combination made the platform both useful and operable.
If you are building autonomous infrastructure tooling, my strongest recommendation is:
- start with message contracts,
- define your lifecycle state machine early,
- and design safety before you design autonomy.
Everything else compounds from there.