Skip to content

Commit 30eb9b9

Browse files
sjarmakclaude
andcommitted
docs: update all documentation to canonical 370-task benchmark setup
- Update suite tables across 14 docs to reflect Neyman-optimal allocation (150 SDLC + 220 Org = 370 tasks, variable suite sizes) - Replace all references to old separate selection file (selected_mcp_unique_tasks.json -> unified selected_benchmark_tasks.json) - Update config pairing: Org tasks now use baseline-local-direct + mcp-remote-direct (not artifact configs) - Delete stale docs/WHITE_PAPER_REPORT_V2.md (duplicate of technical report) - Rewrite BLOG_POST.md as "Part II: CodeScaleBench" with V2 multi-run data - Update TECHNICAL_REPORT_V2.md: taxonomy tables, new sections 11.7-11.12 (language, difficulty, codebase size, MCP tools, cost, timing) - Fix stale counts in DAYTONA.md, CONFIGS.md, TASK_CATALOG.md, REPORT_CONTEXT.md, LEADERBOARD.md, RESULT_DIRECTORY_SPEC.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 92b4eea commit 30eb9b9

File tree

14 files changed

+629
-1797
lines changed

14 files changed

+629
-1797
lines changed

README.md

Lines changed: 58 additions & 62 deletions
Original file line numberDiff line numberDiff line change
@@ -70,37 +70,37 @@ Nine suites organized by software development lifecycle phase:
7070

7171
| Suite | SDLC Phase | Tasks | Description |
7272
|-------|-----------|------:|-------------|
73-
| `csb_sdlc_understand` | Requirements & Discovery | 20 | Codebase comprehension, onboarding, Q&A, knowledge recovery |
74-
| `csb_sdlc_design` | Architecture & Design | 20 | Architecture analysis, dependency graphs, change impact |
75-
| `csb_sdlc_fix` | Bug Repair | 20 | Diagnosing and fixing real bugs across production codebases |
76-
| `csb_sdlc_feature` | Feature Implementation | 20 | New features, interface implementation, big-code features |
77-
| `csb_sdlc_refactor` | Cross-File Refactoring | 20 | Cross-file refactoring, enterprise dependency refactoring, rename refactoring |
78-
| `csb_sdlc_test` | Testing & QA | 20 | Code review, performance testing, code search validation, test generation |
79-
| `csb_sdlc_document` | Documentation | 20 | API references, architecture docs, migration guides, runbooks |
80-
| `csb_sdlc_secure` | Security & Compliance | 20 | CVE analysis, reachability, governance, access control |
81-
| `csb_sdlc_debug` | Debugging & Investigation | 20 | Root cause tracing, fault localization, provenance |
82-
| **Total** | | **180** | |
73+
| `csb_sdlc_fix` | Bug Repair | 26 | Diagnosing and fixing real bugs across production codebases |
74+
| `csb_sdlc_feature` | Feature Implementation | 23 | New features, interface implementation, big-code features |
75+
| `csb_sdlc_debug` | Debugging & Investigation | 18 | Root cause tracing, fault localization, provenance |
76+
| `csb_sdlc_test` | Testing & QA | 18 | Code review, performance testing, code search validation, test generation |
77+
| `csb_sdlc_refactor` | Cross-File Refactoring | 16 | Cross-file refactoring, enterprise dependency refactoring, rename refactoring |
78+
| `csb_sdlc_design` | Architecture & Design | 14 | Architecture analysis, dependency graphs, change impact |
79+
| `csb_sdlc_document` | Documentation | 13 | API references, architecture docs, migration guides, runbooks |
80+
| `csb_sdlc_secure` | Security & Compliance | 12 | CVE analysis, reachability, governance, access control |
81+
| `csb_sdlc_understand` | Requirements & Discovery | 10 | Codebase comprehension, onboarding, Q&A, knowledge recovery |
82+
| **Total** | | **150** | |
8383

8484
## CodeScaleBench-Org
8585

8686
Eleven additional suites measure cross-repo discovery, symbol resolution, dependency tracing, and deep-search-driven investigation in polyrepo environments.
8787

8888
| Suite | Category | Tasks | Description |
8989
|-------|----------|------:|-------------|
90-
| `csb_org_crossrepo_tracing` | A: Dependency Tracing | 20 | Cross-repo dependency chains, blast radius, symbol resolution |
91-
| `csb_org_security` | B: Vulnerability Remediation | 20 | CVE mapping, missing auth middleware across repos |
92-
| `csb_org_migration` | C: Framework Migration | 20 | API migrations, breaking changes across repos |
93-
| `csb_org_incident` | D: Incident Debugging | 20 | Error-to-code-path tracing across microservices |
94-
| `csb_org_onboarding` | E: Onboarding & Comprehension | 20 | API consumption mapping, end-to-end flow, architecture maps |
95-
| `csb_org_compliance` | F: Compliance | 20 | Standards adherence, audit, and provenance workflows |
96-
| `csb_org_crossorg` | G: Cross-Org Discovery | 20 | Interface implementations and authoritative repo identification across orgs |
90+
| `csb_org_onboarding` | E: Onboarding & Comprehension | 28 | API consumption mapping, end-to-end flow, architecture maps |
91+
| `csb_org_migration` | C: Framework Migration | 26 | API migrations, breaking changes across repos |
92+
| `csb_org_security` | B: Vulnerability Remediation | 24 | CVE mapping, missing auth middleware across repos |
93+
| `csb_org_crossrepo_tracing` | A: Dependency Tracing | 22 | Cross-repo dependency chains, blast radius, symbol resolution |
9794
| `csb_org_domain` | H: Domain Lineage | 20 | Config propagation, architecture patterns, domain analysis |
98-
| `csb_org_org` | I: Organizational Context | 20 | Agentic discovery, org-wide coding correctness |
99-
| `csb_org_platform` | J: Platform Knowledge | 20 | Service template discovery and tribal knowledge |
100-
| `csb_org_crossrepo` | K: Cross-Repo Discovery | 20 | Cross-repo search, dependency discovery, impact analysis |
95+
| `csb_org_incident` | D: Incident Debugging | 20 | Error-to-code-path tracing across microservices |
96+
| `csb_org_compliance` | F: Compliance | 18 | Standards adherence, audit, and provenance workflows |
97+
| `csb_org_platform` | J: Platform Knowledge | 18 | Service template discovery and tribal knowledge |
98+
| `csb_org_crossorg` | G: Cross-Org Discovery | 15 | Interface implementations and authoritative repo identification across orgs |
99+
| `csb_org_org` | I: Organizational Context | 15 | Agentic discovery, org-wide coding correctness |
100+
| `csb_org_crossrepo` | K: Cross-Repo Discovery | 14 | Cross-repo search, dependency discovery, impact analysis |
101101
| **Total** | | **220** | |
102102

103-
**Combined catalog total: 400 tasks** (180 SDLC across 9 suites + 220 Org across 11 suites). An additional 28 backup tasks are archived in `benchmarks/backups/`.
103+
**Combined canonical benchmark: 370 tasks** (150 SDLC across 9 suites + 220 Org across 11 suites). Suite sizes are DOE-driven (Neyman-optimal allocation) to maximize statistical power per suite rather than uniform 20-task sizing. An additional 28 backup tasks are archived in `benchmarks/backups/`.
104104

105105
Both baseline and MCP-Full agents have access to **all repos** in each task's fixture. The only difference is the method: baseline reads code locally, MCP-Full uses Sourcegraph MCP tools (local code is truncated). This ensures we measure whether MCP tools help agents work better — not whether MCP can access repos the baseline can't.
106106

@@ -113,7 +113,7 @@ See [docs/MCP_UNIQUE_TASKS.md](docs/MCP_UNIQUE_TASKS.md) for the full task syste
113113
All benchmarks are evaluated across two paper-level configurations (Baseline vs MCP-Full). The concrete run config names differ by task type:
114114

115115
- **SDLC suites** (`csb_sdlc_feature`, `csb_sdlc_refactor`, `csb_sdlc_fix`, etc.): `baseline-local-direct` + `mcp-remote-direct`
116-
- **Org suites** (`csb_org_*`): `baseline-local-artifact` + `mcp-remote-artifact`
116+
- **Org suites** (`csb_org_*`): `baseline-local-direct` + `mcp-remote-direct` (some legacy runs used `baseline-local-artifact` + `mcp-remote-artifact`)
117117

118118
Legacy run directory names (`baseline`, `sourcegraph_full`, `artifact_full`) may still appear in historical outputs and are handled by analysis scripts.
119119

@@ -132,27 +132,27 @@ See [docs/reference/CONFIGS.md](docs/reference/CONFIGS.md) for the canonical con
132132

133133
```
134134
benchmarks/ # Task definitions organized by SDLC phase + Org
135-
csb_sdlc_feature/ # Feature Implementation (20 tasks)
136-
csb_sdlc_refactor/ # Cross-File Refactoring (20 tasks)
137-
csb_sdlc_debug/ # Debugging & Investigation (20 tasks)
138-
csb_sdlc_design/ # Architecture & Design (20 tasks)
139-
csb_sdlc_document/ # Documentation (20 tasks)
140-
csb_sdlc_fix/ # Bug Repair (20 tasks)
141-
csb_sdlc_secure/ # Security & Compliance (20 tasks)
142-
csb_sdlc_test/ # Testing & QA (20 tasks)
143-
csb_sdlc_understand/ # Requirements & Discovery (20 tasks)
135+
csb_sdlc_fix/ # Bug Repair (26 tasks)
136+
csb_sdlc_feature/ # Feature Implementation (23 tasks)
137+
csb_sdlc_debug/ # Debugging & Investigation (18 tasks)
138+
csb_sdlc_test/ # Testing & QA (18 tasks)
139+
csb_sdlc_refactor/ # Cross-File Refactoring (16 tasks)
140+
csb_sdlc_design/ # Architecture & Design (14 tasks)
141+
csb_sdlc_document/ # Documentation (13 tasks)
142+
csb_sdlc_secure/ # Security & Compliance (12 tasks)
143+
csb_sdlc_understand/ # Requirements & Discovery (10 tasks)
144144
backups/ # Archived backup tasks (28 total)
145-
csb_org_compliance/ # Org: compliance & audit (20 tasks)
146-
csb_org_crossorg/ # Org: cross-org discovery (20 tasks)
147-
csb_org_crossrepo/ # Org: cross-repo discovery (20 tasks)
148-
csb_org_crossrepo_tracing/ # Org: dependency tracing (20 tasks)
145+
csb_org_onboarding/ # Org: onboarding (28 tasks)
146+
csb_org_migration/ # Org: framework migration (26 tasks)
147+
csb_org_security/ # Org: vulnerability remediation (24 tasks)
148+
csb_org_crossrepo_tracing/ # Org: dependency tracing (22 tasks)
149149
csb_org_domain/ # Org: domain lineage (20 tasks)
150150
csb_org_incident/ # Org: incident debugging (20 tasks)
151-
csb_org_migration/ # Org: framework migration (20 tasks)
152-
csb_org_onboarding/ # Org: onboarding (20 tasks)
153-
csb_org_org/ # Org: org context (20 tasks)
154-
csb_org_platform/ # Org: platform knowledge (20 tasks)
155-
csb_org_security/ # Org: vulnerability remediation (20 tasks)
151+
csb_org_compliance/ # Org: compliance & audit (18 tasks)
152+
csb_org_platform/ # Org: platform knowledge (18 tasks)
153+
csb_org_crossorg/ # Org: cross-org discovery (15 tasks)
154+
csb_org_org/ # Org: org context (15 tasks)
155+
csb_org_crossrepo/ # Org: cross-repo discovery (14 tasks)
156156
configs/ # Run configs and task selection
157157
_common.sh # Shared infra: token refresh, parallel execution, multi-account
158158
sdlc_suite_2config.sh # Generic SDLC runner (used by phase wrappers below)
@@ -166,8 +166,7 @@ configs/ # Run configs and task selection
166166
test_2config.sh # Phase wrapper: Test (20 tasks)
167167
run_selected_tasks.sh # Unified runner for all tasks
168168
validate_one_per_benchmark.sh # Pre-flight smoke (1 task per suite)
169-
selected_benchmark_tasks.json # Canonical SDLC task selection with metadata
170-
selected_mcp_unique_tasks.json # Org task selection with metadata
169+
selected_benchmark_tasks.json # Canonical task selection: 370 tasks (150 SDLC + 220 Org)
171170
use_case_registry.json # 100 GTM use cases (Org task source)
172171
archive/ # Pre-SDLC migration scripts (preserved for history)
173172
scripts/ # Metrics extraction, evaluation, and operational tooling
@@ -285,10 +284,10 @@ This section assumes Harbor is already installed and configured. If not, start w
285284

286285
### SDLC Tasks
287286

288-
The unified runner executes all 180 SDLC tasks across the 2-config matrix:
287+
The unified runner executes all 370 canonical tasks across the 2-config matrix:
289288

290289
```bash
291-
# Run all 180 SDLC tasks across 2 configs
290+
# Run all 370 tasks across 2 configs
292291
bash configs/run_selected_tasks.sh
293292

294293
# Run only the baseline config
@@ -304,30 +303,27 @@ bash configs/run_selected_tasks.sh --dry-run
304303
Per-phase runners are also available:
305304

306305
```bash
307-
bash configs/fix_2config.sh # 20 Bug Repair tasks
308-
bash configs/feature_2config.sh # 20 Feature Implementation tasks
309-
bash configs/refactor_2config.sh # 20 Cross-File Refactoring tasks
310-
bash configs/understand_2config.sh # 20 Requirements & Discovery tasks
311-
bash configs/design_2config.sh # 20 Architecture & Design tasks
312-
bash configs/debug_2config.sh # 20 Debugging & Investigation tasks
313-
bash configs/secure_2config.sh # 20 Security & Compliance tasks
314-
bash configs/test_2config.sh # 20 Testing & QA tasks
315-
bash configs/document_2config.sh # 20 Documentation tasks
306+
bash configs/fix_2config.sh # 26 Bug Repair tasks
307+
bash configs/feature_2config.sh # 23 Feature Implementation tasks
308+
bash configs/debug_2config.sh # 18 Debugging & Investigation tasks
309+
bash configs/test_2config.sh # 18 Testing & QA tasks
310+
bash configs/refactor_2config.sh # 16 Cross-File Refactoring tasks
311+
bash configs/design_2config.sh # 14 Architecture & Design tasks
312+
bash configs/document_2config.sh # 13 Documentation tasks
313+
bash configs/secure_2config.sh # 12 Security & Compliance tasks
314+
bash configs/understand_2config.sh # 10 Requirements & Discovery tasks
316315
```
317316

318-
### CodeScaleBench-Org Tasks
317+
### Filtering by Suite
319318

320-
Org tasks use a separate selection file:
319+
All tasks (SDLC and Org) are in the unified `selected_benchmark_tasks.json`. Filter by suite with the `--benchmark` flag:
321320

322321
```bash
323-
# Run all Org tasks across 2 configs
324-
bash configs/run_selected_tasks.sh --selection-file configs/selected_mcp_unique_tasks.json
322+
# Run only Org security tasks
323+
bash configs/run_selected_tasks.sh --benchmark csb_org_security
325324

326-
# Filter by use-case category
327-
bash configs/run_selected_tasks.sh --selection-file configs/selected_mcp_unique_tasks.json --benchmark csb_org_security
328-
329-
# Dry run
330-
bash configs/run_selected_tasks.sh --selection-file configs/selected_mcp_unique_tasks.json --dry-run
325+
# Run only SDLC fix tasks
326+
bash configs/run_selected_tasks.sh --benchmark csb_sdlc_fix
331327
```
332328

333329
All runners support `--baseline-only`, `--full-only`, `--task TASK_ID`, and `--parallel N` flags.

0 commit comments

Comments
 (0)