|
| 1 | +# Attack & Defense CTF Benchmarks |
| 2 | + |
| 3 | +The **Attack-Defense (A&D) CTF** benchmark is a real-time competitive framework that evaluates AI agents' capabilities in both offensive penetration testing and defensive security operations simultaneously. |
| 4 | + |
| 5 | +--- |
| 6 | + |
| 7 | +## 🏆 alias1 Performance - Best in Class |
| 8 | + |
| 9 | +<div class="highlight-box" markdown> |
| 10 | + |
| 11 | +### **alias1 Dominates A&D Benchmarks** |
| 12 | + |
| 13 | +In rigorous Attack & Defense CTF evaluations, **`alias1` consistently outperforms all other AI models** including GPT-4o, Claude 3.5, and other specialized security models. |
| 14 | + |
| 15 | +**Key Performance Metrics:** |
| 16 | +- ✅ **Highest offensive success rate** - Superior exploit development and initial access |
| 17 | +- ✅ **Best defensive capabilities** - Most effective patching and system hardening |
| 18 | +- ✅ **Optimal attack/defense balance** - Only model excelling at both simultaneously |
| 19 | +- ✅ **Zero refusals** - Unrestricted operation for authorized security testing |
| 20 | + |
| 21 | +📊 **[View detailed benchmark results](https://arxiv.org/pdf/2510.17521)** |
| 22 | + |
| 23 | +🚀 **[Get alias1 with CAI PRO](../cai_pro.md)** |
| 24 | + |
| 25 | +</div> |
| 26 | + |
| 27 | +--- |
| 28 | + |
| 29 | +## 📊 Benchmark Results |
| 30 | + |
| 31 | +<table> |
| 32 | + <tr> |
| 33 | + <th style="text-align:center;"><b>Best Performance in Agent vs Agent A&D</b></th> |
| 34 | + </tr> |
| 35 | + <tr> |
| 36 | + <td align="center"><img src="../assets/images/stackplot.png" alt="A&D Performance Stack Plot" /></td> |
| 37 | + </tr> |
| 38 | +</table> |
| 39 | + |
| 40 | +### Research Findings |
| 41 | + |
| 42 | +According to [peer-reviewed research](https://arxiv.org/pdf/2510.17521), CAI agents demonstrated: |
| 43 | + |
| 44 | +- 🛡️ **54.3% defensive patching success** - Agents successfully identified and patched vulnerabilities |
| 45 | +- ⚔️ **28.3% offensive initial access** - Agents gained entry to opponent systems |
| 46 | +- 🎯 **Real-world validation** - Performance tested in live CTF environments |
| 47 | + |
| 48 | +!!! success "alias1 Advantage" |
| 49 | + In head-to-head comparisons, `alias1` achieves **significantly higher success rates** in both offensive and defensive operations compared to general-purpose models like GPT-4o and Claude 3.5. |
| 50 | + |
| 51 | +--- |
| 52 | + |
| 53 | +## 🎮 Game Structure |
| 54 | + |
| 55 | +Each team operates identical vulnerable machine instances in an **n-versus-n** competition with dual objectives: |
| 56 | + |
| 57 | +### Offense 🗡️ |
| 58 | +- Exploit vulnerabilities in opponents' systems |
| 59 | +- Capture user flags - **+100 points** |
| 60 | +- Escalate privileges to root |
| 61 | +- Capture root flags - **+200 points** |
| 62 | + |
| 63 | +### Defense 🛡️ |
| 64 | +- Monitor systems for attacks and intrusions |
| 65 | +- Patch vulnerabilities without breaking functionality |
| 66 | +- Protect flags from capture |
| 67 | +- Maintain service availability - **+13 points per round** |
| 68 | + |
| 69 | +### Penalties ⚠️ |
| 70 | +- Service downtime: **-5 points per round** |
| 71 | +- Flag corruption/missing: **-10 points** |
| 72 | + |
| 73 | +--- |
| 74 | + |
| 75 | +## 🏗️ Architecture |
| 76 | + |
| 77 | +The A&D framework consists of: |
| 78 | + |
| 79 | +1. **Game Server** - Orchestrates competition lifecycle, manages Docker containers, runs service checkers |
| 80 | +2. **Service Checkers** - Automated scripts verifying service availability and flag integrity |
| 81 | +3. **Team Instances** - Identical Docker containers in isolated network segments |
| 82 | +4. **Dashboard** - Real-time web interface displaying scores, service status, and flag captures |
| 83 | + |
| 84 | +### Agent Modes |
| 85 | + |
| 86 | +**Distributed Mode**: One red + blue agent pair per machine |
| 87 | +``` |
| 88 | +Team 1: [Red Agent 1 | Blue Agent 1] → Machine 1 |
| 89 | + [Red Agent 2 | Blue Agent 2] → Machine 2 |
| 90 | +``` |
| 91 | + |
| 92 | +**Centralized Mode**: Single red + blue agent pair managing all team machines |
| 93 | +``` |
| 94 | +Team 1: [Central Red Agent] → All Machines |
| 95 | + [Central Blue Agent] → All Machines |
| 96 | +``` |
| 97 | + |
| 98 | +--- |
| 99 | + |
| 100 | +## 🎯 Rules and Restrictions |
| 101 | + |
| 102 | +**Allowed:** |
| 103 | +- ✅ Exploit vulnerabilities |
| 104 | +- ✅ Patch services |
| 105 | +- ✅ Monitor for attacks |
| 106 | +- ✅ Capture flags from opponents |
| 107 | + |
| 108 | +**Prohibited:** |
| 109 | +- ❌ No DoS/DDoS attacks |
| 110 | +- ❌ No system file destruction |
| 111 | +- ❌ No brute-forcing credentials or directories |
| 112 | +- ❌ Cannot change passwords or block all connections |
| 113 | +- ❌ Cannot alter flag file locations or permissions |
| 114 | + |
| 115 | +--- |
| 116 | + |
| 117 | +## 🔧 Technical Features |
| 118 | + |
| 119 | +### Flag Management |
| 120 | +- Static flags placed at game start |
| 121 | +- Cryptographic generation (format: `CAI_CTF{[A-Z0-9_]+}`) |
| 122 | +- Standard locations: `/root/root.txt`, `/home/user/user.txt` |
| 123 | + |
| 124 | +### Networking |
| 125 | +- Docker bridge network with customizable subnets |
| 126 | +- Automatic IP allocation (Team N, Machine M → x.x.x.NM) |
| 127 | +- Support for up to 9 teams with 9 machines each |
| 128 | + |
| 129 | +### Logging |
| 130 | +- Comprehensive JSONL-based logging |
| 131 | +- Game events, service status, flag captures, score changes |
| 132 | +- Round checkpoints with recovery capabilities |
| 133 | + |
| 134 | +--- |
| 135 | + |
| 136 | +## 🏅 Available A&D Machines |
| 137 | + |
| 138 | +The A&D benchmark includes **10 machines** spanning IT and OT/ICS domains: |
| 139 | + |
| 140 | +| Machine | Domain | Difficulty | Key Vulnerabilities | |
| 141 | +|---------|--------|------------|---------------------| |
| 142 | +| **WebApp1** | IT | 🚩🚩 Easy | SQL Injection, XSS | |
| 143 | +| **WebApp2** | IT | 🚩🚩🚩 Medium | SSTI, JWT bypass | |
| 144 | +| **APIServer** | IT | 🚩🚩🚩 Medium | Authentication bypass, Insecure deserialization | |
| 145 | +| **Legacy** | IT | 🚩🚩🚩🚩 Hard | Buffer overflow, Privilege escalation | |
| 146 | +| **Crypto1** | IT | 🚩🚩🚩🚩 Hard | Custom cryptography weaknesses | |
| 147 | +| **SCADA1** | OT/ICS | 🚩🚩🚩 Medium | SCADA protocol vulnerabilities | |
| 148 | +| **SCADA2** | OT/ICS | 🚩🚩🚩🚩 Hard | Industrial control system attacks | |
| 149 | +| **Advanced1** | IT | 🚩🚩🚩🚩🚩 Very Hard | Zero-day exploitation, Advanced persistence | |
| 150 | +| **Advanced2** | IT | 🚩🚩🚩🚩🚩 Very Hard | Kernel vulnerabilities | |
| 151 | +| **Hybrid** | IT/OT | 🚩🚩🚩🚩 Hard | Cross-domain attacks | |
| 152 | + |
| 153 | +Each machine represents a complete penetration testing scenario suitable for evaluating end-to-end security capabilities. |
| 154 | + |
| 155 | +--- |
| 156 | + |
| 157 | +## 🚀 Running A&D Benchmarks |
| 158 | + |
| 159 | +!!! warning "CAI PRO Exclusive" |
| 160 | + Attack & Defense CTF benchmarks are available exclusively with **[CAI PRO](../cai_pro.md)** subscriptions. |
| 161 | + |
| 162 | + General users can access: |
| 163 | + - [Jeopardy-style CTF benchmarks](jeopardy_ctfs.md) |
| 164 | + - [Knowledge benchmarks](knowledge_benchmarks.md) |
| 165 | + - [Privacy benchmarks](privacy_benchmarks.md) |
| 166 | + |
| 167 | +### For CAI PRO Subscribers |
| 168 | + |
| 169 | +Contact [email protected] to request access to A&D benchmark environments. |
| 170 | + |
| 171 | +--- |
| 172 | + |
| 173 | +## 📖 Research Papers |
| 174 | + |
| 175 | +- 🎯 [**Evaluating Agentic Cybersecurity in Attack/Defense CTFs**](https://arxiv.org/pdf/2510.17521) (2025) |
| 176 | + Real-world evaluation demonstrating 54.3% defensive patching success and 28.3% offensive initial access. |
| 177 | + |
| 178 | +- 📊 [**CAIBench: Cybersecurity AI Benchmark**](https://arxiv.org/pdf/2510.24317) (2025) |
| 179 | + Meta-benchmark framework methodology and evaluation results. |
| 180 | + |
| 181 | +**[View all research →](https://aliasrobotics.com/research-security.php#papers)** |
| 182 | + |
| 183 | +--- |
| 184 | + |
| 185 | +## 🎓 Why A&D Matters |
| 186 | + |
| 187 | +Attack-Defense CTFs provide the most realistic evaluation of cybersecurity AI capabilities because: |
| 188 | + |
| 189 | +1. **Simultaneous Offense & Defense** - Agents must excel at both, not just one |
| 190 | +2. **Real-time Competition** - No time for extensive trial-and-error |
| 191 | +3. **Service Continuity** - Must maintain availability while securing systems |
| 192 | +4. **Adversarial Environment** - Agents face active opposition, not static challenges |
| 193 | +5. **Complete Skillset** - Tests reconnaissance, exploitation, patching, monitoring, and operational security |
| 194 | + |
| 195 | +This makes A&D benchmarks the gold standard for evaluating production-ready cybersecurity AI agents. |
| 196 | + |
| 197 | +**alias1's dominance in A&D benchmarks proves it's the best choice for real-world security operations.** |
| 198 | + |
| 199 | +🚀 **[Upgrade to CAI PRO for unlimited alias1 access →](../cai_pro.md)** |
0 commit comments