@@ -70,37 +70,37 @@ Nine suites organized by software development lifecycle phase:
7070
7171| Suite | SDLC Phase | Tasks | Description |
7272| -------| -----------| ------:| -------------|
73- | ` csb_sdlc_understand ` | Requirements & Discovery | 20 | Codebase comprehension, onboarding, Q&A, knowledge recovery |
74- | ` csb_sdlc_design ` | Architecture & Design | 20 | Architecture analysis, dependency graphs, change impact |
75- | ` csb_sdlc_fix ` | Bug Repair | 20 | Diagnosing and fixing real bugs across production codebases |
76- | ` csb_sdlc_feature ` | Feature Implementation | 20 | New features, interface implementation, big- code features |
77- | ` csb_sdlc_refactor ` | Cross-File Refactoring | 20 | Cross-file refactoring, enterprise dependency refactoring, rename refactoring |
78- | ` csb_sdlc_test ` | Testing & QA | 20 | Code review, performance testing, code search validation, test generation |
79- | ` csb_sdlc_document ` | Documentation | 20 | API references, architecture docs, migration guides, runbooks |
80- | ` csb_sdlc_secure ` | Security & Compliance | 20 | CVE analysis, reachability, governance, access control |
81- | ` csb_sdlc_debug ` | Debugging & Investigation | 20 | Root cause tracing, fault localization, provenance |
82- | ** Total** | | ** 180 ** | |
73+ | ` csb_sdlc_fix ` | Bug Repair | 26 | Diagnosing and fixing real bugs across production codebases |
74+ | ` csb_sdlc_feature ` | Feature Implementation | 23 | New features, interface implementation, big-code features |
75+ | ` csb_sdlc_debug ` | Debugging & Investigation | 18 | Root cause tracing, fault localization, provenance |
76+ | ` csb_sdlc_test ` | Testing & QA | 18 | Code review, performance testing, code search validation, test generation |
77+ | ` csb_sdlc_refactor ` | Cross-File Refactoring | 16 | Cross-file refactoring, enterprise dependency refactoring, rename refactoring |
78+ | ` csb_sdlc_design ` | Architecture & Design | 14 | Architecture analysis, dependency graphs, change impact |
79+ | ` csb_sdlc_document ` | Documentation | 13 | API references, architecture docs, migration guides, runbooks |
80+ | ` csb_sdlc_secure ` | Security & Compliance | 12 | CVE analysis, reachability, governance, access control |
81+ | ` csb_sdlc_understand ` | Requirements & Discovery | 10 | Codebase comprehension, onboarding, Q&A, knowledge recovery |
82+ | ** Total** | | ** 150 ** | |
8383
8484## CodeScaleBench-Org
8585
8686Eleven additional suites measure cross-repo discovery, symbol resolution, dependency tracing, and deep-search-driven investigation in polyrepo environments.
8787
8888| Suite | Category | Tasks | Description |
8989| -------| ----------| ------:| -------------|
90- | ` csb_org_crossrepo_tracing ` | A: Dependency Tracing | 20 | Cross-repo dependency chains, blast radius, symbol resolution |
91- | ` csb_org_security ` | B: Vulnerability Remediation | 20 | CVE mapping, missing auth middleware across repos |
92- | ` csb_org_migration ` | C: Framework Migration | 20 | API migrations, breaking changes across repos |
93- | ` csb_org_incident ` | D: Incident Debugging | 20 | Error-to-code-path tracing across microservices |
94- | ` csb_org_onboarding ` | E: Onboarding & Comprehension | 20 | API consumption mapping, end-to-end flow, architecture maps |
95- | ` csb_org_compliance ` | F: Compliance | 20 | Standards adherence, audit, and provenance workflows |
96- | ` csb_org_crossorg ` | G: Cross-Org Discovery | 20 | Interface implementations and authoritative repo identification across orgs |
90+ | ` csb_org_onboarding ` | E: Onboarding & Comprehension | 28 | API consumption mapping, end-to-end flow, architecture maps |
91+ | ` csb_org_migration ` | C: Framework Migration | 26 | API migrations, breaking changes across repos |
92+ | ` csb_org_security ` | B: Vulnerability Remediation | 24 | CVE mapping, missing auth middleware across repos |
93+ | ` csb_org_crossrepo_tracing ` | A: Dependency Tracing | 22 | Cross-repo dependency chains, blast radius, symbol resolution |
9794| ` csb_org_domain ` | H: Domain Lineage | 20 | Config propagation, architecture patterns, domain analysis |
98- | ` csb_org_org ` | I: Organizational Context | 20 | Agentic discovery, org-wide coding correctness |
99- | ` csb_org_platform ` | J: Platform Knowledge | 20 | Service template discovery and tribal knowledge |
100- | ` csb_org_crossrepo ` | K: Cross-Repo Discovery | 20 | Cross-repo search, dependency discovery, impact analysis |
95+ | ` csb_org_incident ` | D: Incident Debugging | 20 | Error-to-code-path tracing across microservices |
96+ | ` csb_org_compliance ` | F: Compliance | 18 | Standards adherence, audit, and provenance workflows |
97+ | ` csb_org_platform ` | J: Platform Knowledge | 18 | Service template discovery and tribal knowledge |
98+ | ` csb_org_crossorg ` | G: Cross-Org Discovery | 15 | Interface implementations and authoritative repo identification across orgs |
99+ | ` csb_org_org ` | I: Organizational Context | 15 | Agentic discovery, org-wide coding correctness |
100+ | ` csb_org_crossrepo ` | K: Cross-Repo Discovery | 14 | Cross-repo search, dependency discovery, impact analysis |
101101| ** Total** | | ** 220** | |
102102
103- ** Combined catalog total: 400 tasks** (180 SDLC across 9 suites + 220 Org across 11 suites). An additional 28 backup tasks are archived in ` benchmarks/backups/ ` .
103+ ** Combined canonical benchmark: 370 tasks** (150 SDLC across 9 suites + 220 Org across 11 suites). Suite sizes are DOE-driven (Neyman-optimal allocation) to maximize statistical power per suite rather than uniform 20-task sizing . An additional 28 backup tasks are archived in ` benchmarks/backups/ ` .
104104
105105Both baseline and MCP-Full agents have access to ** all repos** in each task's fixture. The only difference is the method: baseline reads code locally, MCP-Full uses Sourcegraph MCP tools (local code is truncated). This ensures we measure whether MCP tools help agents work better — not whether MCP can access repos the baseline can't.
106106
@@ -113,7 +113,7 @@ See [docs/MCP_UNIQUE_TASKS.md](docs/MCP_UNIQUE_TASKS.md) for the full task syste
113113All benchmarks are evaluated across two paper-level configurations (Baseline vs MCP-Full). The concrete run config names differ by task type:
114114
115115- ** SDLC suites** (` csb_sdlc_feature ` , ` csb_sdlc_refactor ` , ` csb_sdlc_fix ` , etc.): ` baseline-local-direct ` + ` mcp-remote-direct `
116- - ** Org suites** (` csb_org_* ` ): ` baseline-local-artifact ` + ` mcp-remote-artifact `
116+ - ** Org suites** (` csb_org_* ` ): ` baseline-local-direct ` + ` mcp-remote-direct ` (some legacy runs used ` baseline-local- artifact` + ` mcp-remote-artifact ` )
117117
118118Legacy run directory names (` baseline ` , ` sourcegraph_full ` , ` artifact_full ` ) may still appear in historical outputs and are handled by analysis scripts.
119119
@@ -132,27 +132,27 @@ See [docs/reference/CONFIGS.md](docs/reference/CONFIGS.md) for the canonical con
132132
133133```
134134benchmarks/ # Task definitions organized by SDLC phase + Org
135- csb_sdlc_feature / # Feature Implementation (20 tasks)
136- csb_sdlc_refactor / # Cross-File Refactoring (20 tasks)
137- csb_sdlc_debug/ # Debugging & Investigation (20 tasks)
138- csb_sdlc_design / # Architecture & Design (20 tasks)
139- csb_sdlc_document / # Documentation (20 tasks)
140- csb_sdlc_fix / # Bug Repair (20 tasks)
141- csb_sdlc_secure / # Security & Compliance (20 tasks)
142- csb_sdlc_test / # Testing & QA (20 tasks)
143- csb_sdlc_understand/ # Requirements & Discovery (20 tasks)
135+ csb_sdlc_fix / # Bug Repair (26 tasks)
136+ csb_sdlc_feature / # Feature Implementation (23 tasks)
137+ csb_sdlc_debug/ # Debugging & Investigation (18 tasks)
138+ csb_sdlc_test / # Testing & QA (18 tasks)
139+ csb_sdlc_refactor / # Cross-File Refactoring (16 tasks)
140+ csb_sdlc_design / # Architecture & Design (14 tasks)
141+ csb_sdlc_document / # Documentation (13 tasks)
142+ csb_sdlc_secure / # Security & Compliance (12 tasks)
143+ csb_sdlc_understand/ # Requirements & Discovery (10 tasks)
144144 backups/ # Archived backup tasks (28 total)
145- csb_org_compliance / # Org: compliance & audit (20 tasks)
146- csb_org_crossorg / # Org: cross-org discovery (20 tasks)
147- csb_org_crossrepo / # Org: cross-repo discovery (20 tasks)
148- csb_org_crossrepo_tracing/ # Org: dependency tracing (20 tasks)
145+ csb_org_onboarding / # Org: onboarding (28 tasks)
146+ csb_org_migration / # Org: framework migration (26 tasks)
147+ csb_org_security / # Org: vulnerability remediation (24 tasks)
148+ csb_org_crossrepo_tracing/ # Org: dependency tracing (22 tasks)
149149 csb_org_domain/ # Org: domain lineage (20 tasks)
150150 csb_org_incident/ # Org: incident debugging (20 tasks)
151- csb_org_migration / # Org: framework migration (20 tasks)
152- csb_org_onboarding / # Org: onboarding (20 tasks)
153- csb_org_org / # Org: org context (20 tasks)
154- csb_org_platform / # Org: platform knowledge (20 tasks)
155- csb_org_security / # Org: vulnerability remediation (20 tasks)
151+ csb_org_compliance / # Org: compliance & audit (18 tasks)
152+ csb_org_platform / # Org: platform knowledge (18 tasks)
153+ csb_org_crossorg / # Org: cross- org discovery (15 tasks)
154+ csb_org_org / # Org: org context (15 tasks)
155+ csb_org_crossrepo / # Org: cross-repo discovery (14 tasks)
156156configs/ # Run configs and task selection
157157 _common.sh # Shared infra: token refresh, parallel execution, multi-account
158158 sdlc_suite_2config.sh # Generic SDLC runner (used by phase wrappers below)
@@ -166,8 +166,7 @@ configs/ # Run configs and task selection
166166 test_2config.sh # Phase wrapper: Test (20 tasks)
167167 run_selected_tasks.sh # Unified runner for all tasks
168168 validate_one_per_benchmark.sh # Pre-flight smoke (1 task per suite)
169- selected_benchmark_tasks.json # Canonical SDLC task selection with metadata
170- selected_mcp_unique_tasks.json # Org task selection with metadata
169+ selected_benchmark_tasks.json # Canonical task selection: 370 tasks (150 SDLC + 220 Org)
171170 use_case_registry.json # 100 GTM use cases (Org task source)
172171 archive/ # Pre-SDLC migration scripts (preserved for history)
173172scripts/ # Metrics extraction, evaluation, and operational tooling
@@ -285,10 +284,10 @@ This section assumes Harbor is already installed and configured. If not, start w
285284
286285### SDLC Tasks
287286
288- The unified runner executes all 180 SDLC tasks across the 2-config matrix:
287+ The unified runner executes all 370 canonical tasks across the 2-config matrix:
289288
290289``` bash
291- # Run all 180 SDLC tasks across 2 configs
290+ # Run all 370 tasks across 2 configs
292291bash configs/run_selected_tasks.sh
293292
294293# Run only the baseline config
@@ -304,30 +303,27 @@ bash configs/run_selected_tasks.sh --dry-run
304303Per-phase runners are also available:
305304
306305``` bash
307- bash configs/fix_2config.sh # 20 Bug Repair tasks
308- bash configs/feature_2config.sh # 20 Feature Implementation tasks
309- bash configs/refactor_2config .sh # 20 Cross-File Refactoring tasks
310- bash configs/understand_2config .sh # 20 Requirements & Discovery tasks
311- bash configs/design_2config .sh # 20 Architecture & Design tasks
312- bash configs/debug_2config .sh # 20 Debugging & Investigation tasks
313- bash configs/secure_2config .sh # 20 Security & Compliance tasks
314- bash configs/test_2config .sh # 20 Testing & QA tasks
315- bash configs/document_2config .sh # 20 Documentation tasks
306+ bash configs/fix_2config.sh # 26 Bug Repair tasks
307+ bash configs/feature_2config.sh # 23 Feature Implementation tasks
308+ bash configs/debug_2config .sh # 18 Debugging & Investigation tasks
309+ bash configs/test_2config .sh # 18 Testing & QA tasks
310+ bash configs/refactor_2config .sh # 16 Cross-File Refactoring tasks
311+ bash configs/design_2config .sh # 14 Architecture & Design tasks
312+ bash configs/document_2config .sh # 13 Documentation tasks
313+ bash configs/secure_2config .sh # 12 Security & Compliance tasks
314+ bash configs/understand_2config .sh # 10 Requirements & Discovery tasks
316315```
317316
318- ### CodeScaleBench-Org Tasks
317+ ### Filtering by Suite
319318
320- Org tasks use a separate selection file :
319+ All tasks (SDLC and Org) are in the unified ` selected_benchmark_tasks.json ` . Filter by suite with the ` --benchmark ` flag :
321320
322321``` bash
323- # Run all Org tasks across 2 configs
324- bash configs/run_selected_tasks.sh --selection-file configs/selected_mcp_unique_tasks.json
322+ # Run only Org security tasks
323+ bash configs/run_selected_tasks.sh --benchmark csb_org_security
325324
326- # Filter by use-case category
327- bash configs/run_selected_tasks.sh --selection-file configs/selected_mcp_unique_tasks.json --benchmark csb_org_security
328-
329- # Dry run
330- bash configs/run_selected_tasks.sh --selection-file configs/selected_mcp_unique_tasks.json --dry-run
325+ # Run only SDLC fix tasks
326+ bash configs/run_selected_tasks.sh --benchmark csb_sdlc_fix
331327```
332328
333329All runners support ` --baseline-only ` , ` --full-only ` , ` --task TASK_ID ` , and ` --parallel N ` flags.
0 commit comments