|
1 | | -# CodeContextBench — Benchmark Definitions |
| 1 | +# CodeContextBench |
2 | 2 |
|
3 | | -Benchmark task definitions for evaluating AI coding agents on context-intensive software engineering tasks. |
| 3 | +Benchmark suite for evaluating how AI coding agents leverage external context tools (MCP servers) on software engineering tasks across the SDLC. Developed as the reproducibility artifact for the paper *"Evaluating the Impact of Model Context Protocol on AI Coding Agent Performance Across the Software Development Lifecycle."* |
4 | 4 |
|
5 | | -This repository contains only the **benchmark definitions** (task descriptions, ground truth, adapters, schemas). The orchestration dashboard, runners, agents, and analysis pipeline live in the companion **[CodeContextBench_Dashboard](https://github.com/sjarmak/CodeContextBench_Dashboard)** repository. |
| 5 | +This repository contains **benchmark task definitions**, **evaluation configs**, and a **metrics extraction pipeline**. Tasks are executed via the [Harbor](https://github.com/mainmatter/harbor) runner with the Claude Code agent harness. |
| 6 | + |
| 7 | +--- |
| 8 | + |
| 9 | +## Benchmark Suites |
| 10 | + |
| 11 | +| Suite | Tasks | Languages | Evaluation Method | SDLC Phase | |
| 12 | +|-------|------:|-----------|-------------------|------------| |
| 13 | +| `kubernetes_docs` | 5 | Go | LLM judge + test scripts | Documentation | |
| 14 | +| `big_code_mcp` | 4 | Go, Rust, C++, TypeScript | Test suite | Code navigation | |
| 15 | +| `locobench_agent` | 50 | Multi-language | Semantic similarity | Long-context reasoning | |
| 16 | +| `swebench_pro` | 50 | Multi-language | Test suite | Bug fixing | |
| 17 | +| `github_mined` | 25 | Python | Test suite | Feature implementation | |
| 18 | + |
| 19 | +Additional benchmark suites (`sweperf`, `tac_mcp_value`, `dibench`, `repoqa`, etc.) are included in early stages of development. |
| 20 | + |
| 21 | +--- |
| 22 | + |
| 23 | +## 3-Config Evaluation Matrix |
| 24 | + |
| 25 | +All benchmarks are evaluated across three agent configurations that vary the external context tools available via MCP: |
| 26 | + |
| 27 | +| Paper Config Name | `BASELINE_MCP_TYPE` | MCP Tools Available | |
| 28 | +|-------------------|---------------------|---------------------| |
| 29 | +| Baseline | `none` | None (agent uses only built-in tools) | |
| 30 | +| MCP-NoDeepSearch | `sourcegraph_no_deepsearch` | `sg_keyword_search`, `sg_read_file`, `sg_find_file`, `sg_nls_search`, `sg_search_suggestions`, `sg_get_context` (6 tools) | |
| 31 | +| MCP-Full | `sourcegraph_hybrid` | All MCP-NoDeepSearch tools + `sg_deepsearch`, `sg_deepsearch_read` (8 tools) | |
| 32 | + |
| 33 | +See [docs/CONFIGS.md](docs/CONFIGS.md) for the full tool-by-tool breakdown. |
6 | 34 |
|
7 | 35 | --- |
8 | 36 |
|
9 | 37 | ## Repository Structure |
10 | 38 |
|
11 | 39 | ``` |
12 | | -benchmarks/ # Task definitions organized by benchmark suite |
13 | | - kubernetes_docs/ # K8s documentation generation tasks |
14 | | - big_code_mcp/ # Large codebase navigation tasks |
15 | | - github_mined/ # GitHub-mined SWE tasks |
16 | | - locobench_agent/ # LoCoBench agent tasks |
17 | | - sweperf/ # SWE-Perf optimization tasks |
18 | | - tac_mcp_value/ # TAC MCP-value tasks |
19 | | - 10figure/ # 10-Figure corpus tasks |
20 | | - ... |
21 | | -schemas/ # JSON schemas for MANIFEST.json, task.toml, etc. |
22 | | -swe_bench_configs/ # SWE-Bench integration configuration |
| 40 | +benchmarks/ # Task definitions organized by benchmark suite |
| 41 | + kubernetes_docs/ # K8s package documentation generation (5 tasks) |
| 42 | + big_code_mcp/ # Large-repo code navigation (4 tasks) |
| 43 | + locobench_agent/ # LoCoBench long-context agent tasks (50 tasks) |
| 44 | + swebench_pro/ # SWE-Bench Pro bug-fixing tasks (731 available, 50 selected) |
| 45 | + github_mined/ # GitHub-mined SWE tasks (25 tasks) |
| 46 | + ... # Additional suites in development |
| 47 | +ralph/ # Agent working directory |
| 48 | + configs/ # 3-config comparison YAML + shell runners per benchmark |
| 49 | + scripts/ # Metrics extraction and evaluation pipeline |
| 50 | + ccb_metrics/ # Python package: models, extractors, discovery, judge context |
| 51 | + generate_eval_report.py # CLI: deterministic evaluation report generator |
| 52 | +docs/ # Configuration documentation and diagnosis reports |
| 53 | +schemas/ # JSON schemas for MANIFEST.json, task.toml, etc. |
| 54 | +swe_bench_configs/ # SWE-Bench integration configuration |
23 | 55 | ``` |
24 | 56 |
|
25 | 57 | Each benchmark directory contains: |
26 | 58 | - `MANIFEST.json` — metadata, task IDs, evaluation config |
27 | | -- Per-task subdirectories with `instruction.md`, `task.toml`, tests, and ground truth |
| 59 | +- Per-task subdirectories with `instruction.md`, `task.toml`, tests, and ground truth (or `solution/`) |
28 | 60 |
|
29 | 61 | --- |
30 | 62 |
|
31 | | -## Usage with Dashboard |
| 63 | +## Metrics Extraction Pipeline |
32 | 64 |
|
33 | | -Set the `CCB_BENCHMARKS_DIR` environment variable in the Dashboard repo to point here: |
| 65 | +The `ralph/scripts/` directory contains a stdlib-only Python 3.10+ pipeline for extracting deterministic metrics from Harbor run output: |
34 | 66 |
|
35 | 67 | ```bash |
36 | | -export CCB_BENCHMARKS_DIR="/path/to/CodeContextBench" |
| 68 | +cd ralph/ |
| 69 | + |
| 70 | +# Generate evaluation report from Harbor runs |
| 71 | +python3 scripts/generate_eval_report.py \ |
| 72 | + --runs-dir /path/to/runs/official/ \ |
| 73 | + --output-dir ./eval_reports/ |
| 74 | + |
| 75 | +# Generate LLM judge context files |
| 76 | +python3 -m scripts.ccb_metrics.judge_context \ |
| 77 | + --runs-dir /path/to/runs/official/ \ |
| 78 | + --benchmarks-dir ../benchmarks/ \ |
| 79 | + --output-dir ./judge_contexts/ |
37 | 80 | ``` |
38 | 81 |
|
39 | | -Or place both repos as siblings (the default): |
| 82 | +The report generator produces: |
| 83 | +- `eval_report.json` — full structured report |
| 84 | +- `REPORT.md` — markdown tables (performance, efficiency, tool utilization) |
| 85 | +- `harness_configs.json` — exact harness configuration per run |
| 86 | +- CSV files per table for downstream analysis |
40 | 87 |
|
41 | | -``` |
42 | | -parent/ |
43 | | - CodeContextBench/ # this repo |
44 | | - CodeContextBench_Dashboard/ # dashboard + orchestration |
45 | | -``` |
| 88 | +See `python3 ralph/scripts/generate_eval_report.py --help` for all options. |
46 | 89 |
|
47 | 90 | --- |
48 | 91 |
|
49 | | -## Benchmark Suites |
50 | | - |
51 | | -| Suite | Tasks | Description | |
52 | | -|-------|-------|-------------| |
53 | | -| kubernetes_docs | 5 | K8s package documentation generation | |
54 | | -| big_code_mcp | 12 | Large-repo code navigation & understanding | |
55 | | -| github_mined | 30 | Real GitHub SWE tasks (Harbor format) | |
56 | | -| locobench_agent | 50 | LoCoBench long-context agent tasks | |
57 | | -| sweperf | 20 | SWE-Bench performance optimization | |
58 | | -| tac_mcp_value | 20 | TheAgentCompany MCP-value tasks | |
59 | | -| 10figure | 0 | 10-Figure corpus (pending data) | |
| 92 | +## Running with Harbor |
60 | 93 |
|
61 | | ---- |
| 94 | +Each benchmark has a shell runner in `ralph/configs/` that executes all tasks across the 3-config matrix: |
62 | 95 |
|
63 | | -## select_tasks.py Scripts |
| 96 | +```bash |
| 97 | +# Run all 50 LoCoBench tasks across 3 configs |
| 98 | +bash ralph/configs/locobench_3config.sh |
64 | 99 |
|
65 | | -Some benchmarks include `select_tasks.py` scripts that require the Dashboard repo's `src/` package on `PYTHONPATH`. To run them: |
| 100 | +# Run only the baseline config |
| 101 | +bash ralph/configs/locobench_3config.sh --baseline-only |
66 | 102 |
|
67 | | -```bash |
68 | | -export PYTHONPATH="/path/to/CodeContextBench_Dashboard:$PYTHONPATH" |
69 | | -python benchmarks/<suite>/select_tasks.py |
| 103 | +# Run only MCP-Full config |
| 104 | +bash ralph/configs/locobench_3config.sh --full-only |
70 | 105 | ``` |
71 | 106 |
|
| 107 | +Available runners: `locobench_3config.sh`, `swebenchpro_3config.sh`, `bigcode_3config.sh`, `k8s_docs_3config.sh`. |
| 108 | + |
| 109 | +Requires [Harbor](https://github.com/mainmatter/harbor) installed and configured with a Claude API key. |
| 110 | + |
72 | 111 | --- |
73 | 112 |
|
74 | 113 | ## License |
|
0 commit comments