Agent-Based SRE: Automated Diagnosis and Mitigation in K8s

Repository Structure

SRE-agent/
├── sre-agent/          # Main SRE agent implementation
├── MCP-server/         # Model Context Protocol server for observability
├── notebooks/          # Jupyter notebooks for development and analysis
├── Results/            # Experiment outputs and reports
└── archive/            # Previous project iterations

🤖 SRE Agent

The SRE agent automates incident response in Kubernetes environments through a multi-agent workflow:

graph TD
    Start([Start]) --> Triage[🔍 Triage Agent<br/>Gather observability data]
    Triage -->|Symptoms List| Planner[📋 Planning Agent<br/>Enrich with dependencies<br/>Create RCA tasks]
    Planner -.->|Task 1| RCA1[🔬 RCA Agent 1<br/>Tool budget: 8 calls]
    Planner -.->|Task 2| RCA2[🔬 RCA Agent 2<br/>Tool budget: 8 calls]
    Planner -.->|Task N| RCAN[🔬 RCA Agent N<br/>Tool budget: 8 calls]
    RCA1 -->|Diagnostic Report 1| Supervisor[👔 Supervisor Agent<br/>Correlate findings<br/>Synthesize root cause]
    RCA2 -->|Diagnostic Report 2| Supervisor
    RCAN -->|Diagnostic Report N| Supervisor
    Supervisor --> End([End: Final Root Cause])
    
    style Triage fill:#e1f5ff,stroke:#0066cc,stroke-width:2px,color:#000
    style Planner fill:#fff4e1,stroke:#ff9900,stroke-width:2px,color:#000
    style RCA1 fill:#ffe1f5,stroke:#cc0066,stroke-width:2px,color:#000
    style RCA2 fill:#ffe1f5,stroke:#cc0066,stroke-width:2px,color:#000
    style RCAN fill:#ffe1f5,stroke:#cc0066,stroke-width:2px,color:#000
    style Supervisor fill:#e1ffe1,stroke:#009900,stroke-width:2px,color:#000
    style Start fill:#f0f0f0,stroke:#333,stroke-width:2px,color:#000
    style End fill:#f0f0f0,stroke:#333,stroke-width:2px,color:#000

🔍 Triage Agent: Gathers metrics, logs, and traces Prometheus, Jaeger, and K8s to quickly identify problematic pods, error traces, slow traces, and anomalous metrics. Analyzes raw observability data and outputs a list of symptoms with affected resources and evidence.

📋 Planning Agent: Takes symptoms and enriches them with dependency information (data and infrastructure dependencies). Analyzes correlations and creates a de-duplicated, prioritized list of RCA tasks for parallel investigation. Each task includes investigation goal, target resource, and suggested tools.

🔬 RCA Worker Agents: Execute investigation tasks in parallel. Each worker has a budget of tool calls and autonomously uses MCP tools (kubectl, logs, traces, metrics) to gather evidence. Workers summarize findings after each step and submit a diagnostic report with diagnosis, reasoning, insights, and tool usage stats.

👔 Supervisor Agent: Aggregates all worker reports, correlates findings across investigations, and synthesizes a final root cause diagnosis with affected resources, evidence summary, and investigation overview.

The system leverages the MCP server to interface with observability tools and Kubernetes APIs. Parallel execution at investigation (RCA workers) stage significantly speeds up diagnosis. Each run produces a comprehensive JSON report with symptoms, tasks, individual analyses, and final diagnosis.

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
.vscode		.vscode
MCP-server @ 0b0e18f		MCP-server @ 0b0e18f
Results		Results
archive		archive
notebooks		notebooks
sre-agent		sre-agent
.env.example		.env.example
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Agent-Based SRE: Automated Diagnosis and Mitigation in K8s

Repository Structure

🤖 SRE Agent

About

Uh oh!

Uh oh!

Languages

martinimarcello00/SRE-agent

Folders and files

Latest commit

History

Repository files navigation

Agent-Based SRE: Automated Diagnosis and Mitigation in K8s

Repository Structure

🤖 SRE Agent

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages