aliasrobotics
diff --git a/‎README.md‎
Lines changed: 43 additions & 39 deletions b/‎README.md‎
Lines changed: 43 additions & 39 deletions
diff --git a/‎docs/benchmarking/attack_defense.md‎
Lines changed: 199 additions & 0 deletions b/‎docs/benchmarking/attack_defense.md‎
Lines changed: 199 additions & 0 deletions
@@ -1184,53 +1184,57 @@ The easiest way to get around this is to simply install [`python3.12`](https://w
 </details>
 
 ## Citation
-If you want to cite our work, please use the following:
+
+If you want to cite our work, please use the following (ordered by publication date):
 
 ```bibtex
-@misc{mayoralvilches2025caiopenbugbountyready,
-      title={CAI: An Open, Bug Bounty-Ready Cybersecurity AI},
-      author={Víctor Mayoral-Vilches and Luis Javier Navarrete-Lozano and María Sanz-Gómez and Lidia Salas Espejo and Martiño Crespo-Álvarez and Francisco Oca-Gonzalez and Francesco Balassone and Alfonso Glera-Picón and Unai Ayucar-Carbajo and Jon Ander Ruiz-Alcalde and Stefan Rass and Martin Pinzger and Endika Gil-Uriarte},
-      year={2025},
-      eprint={2504.06017},
-      archivePrefix={arXiv},
-      primaryClass={cs.CR},
-      url={https://arxiv.org/abs/2504.06017},
+@article{mayoral2025cai,
+  title={CAI: An Open, Bug Bounty-Ready Cybersecurity AI},
+  author={Mayoral-Vilches, V{\'\i}ctor and Navarrete-Lozano, Luis Javier and Sanz-G{\'o}mez, Mar{\'\i}a and Espejo, Lidia Salas and Crespo-{\'A}lvarez, Marti{\~n}o and Oca-Gonzalez, Francisco and Balassone, Francesco and Glera-Pic{\'o}n, Alfonso and Ayucar-Carbajo, Unai and Ruiz-Alcalde, Jon Ander and Rass, Stefan and Pinzger, Martin and Gil-Uriarte, Endika},
+  journal={arXiv preprint arXiv:2504.06017},
+  year={2025}
 }
-```
 
-```bibtex
-@misc{mayoralvilches2025cybersecurityaidangerousgap,
-      title={Cybersecurity AI: The Dangerous Gap Between Automation and Autonomy}, 
-      author={Víctor Mayoral-Vilches},
-      year={2025},
-      eprint={2506.23592},
-      archivePrefix={arXiv},
-      primaryClass={cs.CR},
-      url={https://arxiv.org/abs/2506.23592}, 
+@article{mayoral2025automation,
+  title={Cybersecurity AI: The Dangerous Gap Between Automation and Autonomy},
+  author={Mayoral-Vilches, V{\'\i}ctor},
+  journal={arXiv preprint arXiv:2506.23592},
+  year={2025}
 }
-```
 
-```bibtex
-@misc{mayoralvilches2025caifluencyframeworkcybersecurity,
-      title={CAI Fluency: A Framework for Cybersecurity AI Fluency}, 
-      author={Víctor Mayoral-Vilches and Jasmin Wachter and Cristóbal R. J. Veas Chavez and Cathrin Schachner and Luis Javier Navarrete-Lozano and María Sanz-Gómez},
-      year={2025},
-      eprint={2508.13588},
-      archivePrefix={arXiv},
-      primaryClass={cs.CR},
-      url={https://arxiv.org/abs/2508.13588}, 
+@article{mayoral2025fluency,
+  title={CAI Fluency: A Framework for Cybersecurity AI Fluency},
+  author={Mayoral-Vilches, V{\'\i}ctor and Wachter, Jasmin and Chavez, Crist{\'o}bal RJ and Schachner, Cathrin and Navarrete-Lozano, Luis Javier and Sanz-G{\'o}mez, Mar{\'\i}a},
+  journal={arXiv preprint arXiv:2508.13588},
+  year={2025}
 }
-```
 
-```bibtex
-@misc{mayoralvilches2025cybersecurityaihackingai,
-      title={Cybersecurity AI: Hacking the AI Hackers via Prompt Injection}, 
-      author={Víctor Mayoral-Vilches and Per Mannermaa Rynning},
-      year={2025},
-      eprint={2508.21669},
-      archivePrefix={arXiv},
-      primaryClass={cs.CR},
-      url={https://arxiv.org/abs/2508.21669}, 
+@article{mayoral2025hacking,
+  title={Cybersecurity AI: Hacking the AI Hackers via Prompt Injection},
+  author={Mayoral-Vilches, V{\'\i}ctor and Rynning, Per Mannermaa},
+  journal={arXiv preprint arXiv:2508.21669},
+  year={2025}
+}
+
+@article{mayoral2025humanoid,
+  title={Cybersecurity AI: Humanoid Robots as Attack Vectors},
+  author={Mayoral-Vilches, V{\'\i}ctor},
+  journal={arXiv preprint arXiv:2509.14139},
+  year={2025}
+}
+
+@article{balassone2025evaluation,
+  title={Cybersecurity AI: Evaluating Agentic Cybersecurity in Attack/Defense CTFs},
+  author={Balassone, Francesco and Mayoral-Vilches, V{\'\i}ctor and Rass, Stefan and Pinzger, Martin and Perrone, Gaetano and Romano, Simon Pietro and Schartner, Peter},
+  journal={arXiv preprint arXiv:2510.17521},
+  year={2025}
+}
+
+@article{mayoral2025caibench,
+  title={CAIBench: A Meta-Benchmark for Evaluating Cybersecurity AI Agents},
+  author={Mayoral-Vilches, V{\'\i}ctor and Balassone, Francesco and Navarrete-Lozano, Luis Javier and Sanz-G{\'o}mez, Mar{\'\i}a and Crespo-{\'A}lvarez, Marti{\~n}o and Rass, Stefan and Pinzger, Martin},
+  journal={arXiv preprint arXiv:2510.24317},
+  year={2025}
 }
 ```
 
 
@@ -0,0 +1,199 @@
+# Attack & Defense CTF Benchmarks
+
+The **Attack-Defense (A&D) CTF** benchmark is a real-time competitive framework that evaluates AI agents' capabilities in both offensive penetration testing and defensive security operations simultaneously.
+
+---
+
+## 🏆 alias1 Performance - Best in Class
+
+<div class="highlight-box" markdown>
+
+### **alias1 Dominates A&D Benchmarks**
+
+In rigorous Attack & Defense CTF evaluations, **`alias1` consistently outperforms all other AI models** including GPT-4o, Claude 3.5, and other specialized security models.
+
+**Key Performance Metrics:**
+- ✅ **Highest offensive success rate** - Superior exploit development and initial access
+- ✅ **Best defensive capabilities** - Most effective patching and system hardening
+- ✅ **Optimal attack/defense balance** - Only model excelling at both simultaneously
+- ✅ **Zero refusals** - Unrestricted operation for authorized security testing
+
+📊 **[View detailed benchmark results](https://arxiv.org/pdf/2510.17521)**
+
+🚀 **[Get alias1 with CAI PRO](../cai_pro.md)**
+
+</div>
+
+---
+
+## 📊 Benchmark Results
+
+<table>
+  <tr>
+    <th style="text-align:center;"><b>Best Performance in Agent vs Agent A&D</b></th>
+  </tr>
+  <tr>
+    <td align="center"><img src="../assets/images/stackplot.png" alt="A&D Performance Stack Plot" /></td>
+  </tr>
+</table>
+
+### Research Findings
+
+According to [peer-reviewed research](https://arxiv.org/pdf/2510.17521), CAI agents demonstrated:
+
+- 🛡️ **54.3% defensive patching success** - Agents successfully identified and patched vulnerabilities
+- ⚔️ **28.3% offensive initial access** - Agents gained entry to opponent systems
+- 🎯 **Real-world validation** - Performance tested in live CTF environments
+
+!!! success "alias1 Advantage"
+    In head-to-head comparisons, `alias1` achieves **significantly higher success rates** in both offensive and defensive operations compared to general-purpose models like GPT-4o and Claude 3.5.
+
+---
+
+## 🎮 Game Structure
+
+Each team operates identical vulnerable machine instances in an **n-versus-n** competition with dual objectives:
+
+### Offense 🗡️
+- Exploit vulnerabilities in opponents' systems
+- Capture user flags - **+100 points**
+- Escalate privileges to root
+- Capture root flags - **+200 points**
+
+### Defense 🛡️
+- Monitor systems for attacks and intrusions
+- Patch vulnerabilities without breaking functionality
+- Protect flags from capture
+- Maintain service availability - **+13 points per round**
+
+### Penalties ⚠️
+- Service downtime: **-5 points per round**
+- Flag corruption/missing: **-10 points**
+
+---
+
+## 🏗️ Architecture
+
+The A&D framework consists of:
+
+1. **Game Server** - Orchestrates competition lifecycle, manages Docker containers, runs service checkers
+2. **Service Checkers** - Automated scripts verifying service availability and flag integrity
+3. **Team Instances** - Identical Docker containers in isolated network segments
+4. **Dashboard** - Real-time web interface displaying scores, service status, and flag captures
+
+### Agent Modes
+
+**Distributed Mode**: One red + blue agent pair per machine
+```
+Team 1: [Red Agent 1 | Blue Agent 1] → Machine 1
+        [Red Agent 2 | Blue Agent 2] → Machine 2
+```
+
+**Centralized Mode**: Single red + blue agent pair managing all team machines
+```
+Team 1: [Central Red Agent] → All Machines
+        [Central Blue Agent] → All Machines
+```
+
+---
+
+## 🎯 Rules and Restrictions
+
+**Allowed:**
+- ✅ Exploit vulnerabilities
+- ✅ Patch services
+- ✅ Monitor for attacks
+- ✅ Capture flags from opponents
+
+**Prohibited:**
+- ❌ No DoS/DDoS attacks
+- ❌ No system file destruction
+- ❌ No brute-forcing credentials or directories
+- ❌ Cannot change passwords or block all connections
+- ❌ Cannot alter flag file locations or permissions
+
+---
+
+## 🔧 Technical Features
+
+### Flag Management
+- Static flags placed at game start
+- Cryptographic generation (format: `CAI_CTF{[A-Z0-9_]+}`)
+- Standard locations: `/root/root.txt`, `/home/user/user.txt`
+
+### Networking
+- Docker bridge network with customizable subnets
+- Automatic IP allocation (Team N, Machine M → x.x.x.NM)
+- Support for up to 9 teams with 9 machines each
+
+### Logging
+- Comprehensive JSONL-based logging
+- Game events, service status, flag captures, score changes
+- Round checkpoints with recovery capabilities
+
+---
+
+## 🏅 Available A&D Machines
+
+The A&D benchmark includes **10 machines** spanning IT and OT/ICS domains:
+
+| Machine | Domain | Difficulty | Key Vulnerabilities |
+|---------|--------|------------|---------------------|
+| **WebApp1** | IT | 🚩🚩 Easy | SQL Injection, XSS |
+| **WebApp2** | IT | 🚩🚩🚩 Medium | SSTI, JWT bypass |
+| **APIServer** | IT | 🚩🚩🚩 Medium | Authentication bypass, Insecure deserialization |
+| **Legacy** | IT | 🚩🚩🚩🚩 Hard | Buffer overflow, Privilege escalation |
+| **Crypto1** | IT | 🚩🚩🚩🚩 Hard | Custom cryptography weaknesses |
+| **SCADA1** | OT/ICS | 🚩🚩🚩 Medium | SCADA protocol vulnerabilities |
+| **SCADA2** | OT/ICS | 🚩🚩🚩🚩 Hard | Industrial control system attacks |
+| **Advanced1** | IT | 🚩🚩🚩🚩🚩 Very Hard | Zero-day exploitation, Advanced persistence |
+| **Advanced2** | IT | 🚩🚩🚩🚩🚩 Very Hard | Kernel vulnerabilities |
+| **Hybrid** | IT/OT | 🚩🚩🚩🚩 Hard | Cross-domain attacks |
+
+Each machine represents a complete penetration testing scenario suitable for evaluating end-to-end security capabilities.
+
+---
+
+## 🚀 Running A&D Benchmarks
+
+!!! warning "CAI PRO Exclusive"
+    Attack & Defense CTF benchmarks are available exclusively with **[CAI PRO](../cai_pro.md)** subscriptions.
+
+    General users can access:
+    - [Jeopardy-style CTF benchmarks](jeopardy_ctfs.md)
+    - [Knowledge benchmarks](knowledge_benchmarks.md)
+    - [Privacy benchmarks](privacy_benchmarks.md)
+
+### For CAI PRO Subscribers
+
+Contact [email protected] to request access to A&D benchmark environments.
+
+---
+
+## 📖 Research Papers
+
+- 🎯 [**Evaluating Agentic Cybersecurity in Attack/Defense CTFs**](https://arxiv.org/pdf/2510.17521) (2025)
+  Real-world evaluation demonstrating 54.3% defensive patching success and 28.3% offensive initial access.
+
+- 📊 [**CAIBench: Cybersecurity AI Benchmark**](https://arxiv.org/pdf/2510.24317) (2025)
+  Meta-benchmark framework methodology and evaluation results.
+
+**[View all research →](https://aliasrobotics.com/research-security.php#papers)**
+
+---
+
+## 🎓 Why A&D Matters
+
+Attack-Defense CTFs provide the most realistic evaluation of cybersecurity AI capabilities because:
+
+1. **Simultaneous Offense & Defense** - Agents must excel at both, not just one
+2. **Real-time Competition** - No time for extensive trial-and-error
+3. **Service Continuity** - Must maintain availability while securing systems
+4. **Adversarial Environment** - Agents face active opposition, not static challenges
+5. **Complete Skillset** - Tests reconnaissance, exploitation, patching, monitoring, and operational security
+
+This makes A&D benchmarks the gold standard for evaluating production-ready cybersecurity AI agents.
+
+**alias1's dominance in A&D benchmarks proves it's the best choice for real-world security operations.**
+
+🚀 **[Upgrade to CAI PRO for unlimited alias1 access →](../cai_pro.md)**