Skip to content

Commit e298e04

Browse files
LoCoBench Botclaude
andcommitted
feat: [US-014] - Update README and push benchmarks repo to GitHub
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1 parent 2a7b7a4 commit e298e04

File tree

1 file changed

+80
-41
lines changed

1 file changed

+80
-41
lines changed

README.md

Lines changed: 80 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -1,74 +1,113 @@
1-
# CodeContextBench — Benchmark Definitions
1+
# CodeContextBench
22

3-
Benchmark task definitions for evaluating AI coding agents on context-intensive software engineering tasks.
3+
Benchmark suite for evaluating how AI coding agents leverage external context tools (MCP servers) on software engineering tasks across the SDLC. Developed as the reproducibility artifact for the paper *"Evaluating the Impact of Model Context Protocol on AI Coding Agent Performance Across the Software Development Lifecycle."*
44

5-
This repository contains only the **benchmark definitions** (task descriptions, ground truth, adapters, schemas). The orchestration dashboard, runners, agents, and analysis pipeline live in the companion **[CodeContextBench_Dashboard](https://github.com/sjarmak/CodeContextBench_Dashboard)** repository.
5+
This repository contains **benchmark task definitions**, **evaluation configs**, and a **metrics extraction pipeline**. Tasks are executed via the [Harbor](https://github.com/mainmatter/harbor) runner with the Claude Code agent harness.
6+
7+
---
8+
9+
## Benchmark Suites
10+
11+
| Suite | Tasks | Languages | Evaluation Method | SDLC Phase |
12+
|-------|------:|-----------|-------------------|------------|
13+
| `kubernetes_docs` | 5 | Go | LLM judge + test scripts | Documentation |
14+
| `big_code_mcp` | 4 | Go, Rust, C++, TypeScript | Test suite | Code navigation |
15+
| `locobench_agent` | 50 | Multi-language | Semantic similarity | Long-context reasoning |
16+
| `swebench_pro` | 50 | Multi-language | Test suite | Bug fixing |
17+
| `github_mined` | 25 | Python | Test suite | Feature implementation |
18+
19+
Additional benchmark suites (`sweperf`, `tac_mcp_value`, `dibench`, `repoqa`, etc.) are included in early stages of development.
20+
21+
---
22+
23+
## 3-Config Evaluation Matrix
24+
25+
All benchmarks are evaluated across three agent configurations that vary the external context tools available via MCP:
26+
27+
| Paper Config Name | `BASELINE_MCP_TYPE` | MCP Tools Available |
28+
|-------------------|---------------------|---------------------|
29+
| Baseline | `none` | None (agent uses only built-in tools) |
30+
| MCP-NoDeepSearch | `sourcegraph_no_deepsearch` | `sg_keyword_search`, `sg_read_file`, `sg_find_file`, `sg_nls_search`, `sg_search_suggestions`, `sg_get_context` (6 tools) |
31+
| MCP-Full | `sourcegraph_hybrid` | All MCP-NoDeepSearch tools + `sg_deepsearch`, `sg_deepsearch_read` (8 tools) |
32+
33+
See [docs/CONFIGS.md](docs/CONFIGS.md) for the full tool-by-tool breakdown.
634

735
---
836

937
## Repository Structure
1038

1139
```
12-
benchmarks/ # Task definitions organized by benchmark suite
13-
kubernetes_docs/ # K8s documentation generation tasks
14-
big_code_mcp/ # Large codebase navigation tasks
15-
github_mined/ # GitHub-mined SWE tasks
16-
locobench_agent/ # LoCoBench agent tasks
17-
sweperf/ # SWE-Perf optimization tasks
18-
tac_mcp_value/ # TAC MCP-value tasks
19-
10figure/ # 10-Figure corpus tasks
20-
...
21-
schemas/ # JSON schemas for MANIFEST.json, task.toml, etc.
22-
swe_bench_configs/ # SWE-Bench integration configuration
40+
benchmarks/ # Task definitions organized by benchmark suite
41+
kubernetes_docs/ # K8s package documentation generation (5 tasks)
42+
big_code_mcp/ # Large-repo code navigation (4 tasks)
43+
locobench_agent/ # LoCoBench long-context agent tasks (50 tasks)
44+
swebench_pro/ # SWE-Bench Pro bug-fixing tasks (731 available, 50 selected)
45+
github_mined/ # GitHub-mined SWE tasks (25 tasks)
46+
... # Additional suites in development
47+
ralph/ # Agent working directory
48+
configs/ # 3-config comparison YAML + shell runners per benchmark
49+
scripts/ # Metrics extraction and evaluation pipeline
50+
ccb_metrics/ # Python package: models, extractors, discovery, judge context
51+
generate_eval_report.py # CLI: deterministic evaluation report generator
52+
docs/ # Configuration documentation and diagnosis reports
53+
schemas/ # JSON schemas for MANIFEST.json, task.toml, etc.
54+
swe_bench_configs/ # SWE-Bench integration configuration
2355
```
2456

2557
Each benchmark directory contains:
2658
- `MANIFEST.json` — metadata, task IDs, evaluation config
27-
- Per-task subdirectories with `instruction.md`, `task.toml`, tests, and ground truth
59+
- Per-task subdirectories with `instruction.md`, `task.toml`, tests, and ground truth (or `solution/`)
2860

2961
---
3062

31-
## Usage with Dashboard
63+
## Metrics Extraction Pipeline
3264

33-
Set the `CCB_BENCHMARKS_DIR` environment variable in the Dashboard repo to point here:
65+
The `ralph/scripts/` directory contains a stdlib-only Python 3.10+ pipeline for extracting deterministic metrics from Harbor run output:
3466

3567
```bash
36-
export CCB_BENCHMARKS_DIR="/path/to/CodeContextBench"
68+
cd ralph/
69+
70+
# Generate evaluation report from Harbor runs
71+
python3 scripts/generate_eval_report.py \
72+
--runs-dir /path/to/runs/official/ \
73+
--output-dir ./eval_reports/
74+
75+
# Generate LLM judge context files
76+
python3 -m scripts.ccb_metrics.judge_context \
77+
--runs-dir /path/to/runs/official/ \
78+
--benchmarks-dir ../benchmarks/ \
79+
--output-dir ./judge_contexts/
3780
```
3881

39-
Or place both repos as siblings (the default):
82+
The report generator produces:
83+
- `eval_report.json` — full structured report
84+
- `REPORT.md` — markdown tables (performance, efficiency, tool utilization)
85+
- `harness_configs.json` — exact harness configuration per run
86+
- CSV files per table for downstream analysis
4087

41-
```
42-
parent/
43-
CodeContextBench/ # this repo
44-
CodeContextBench_Dashboard/ # dashboard + orchestration
45-
```
88+
See `python3 ralph/scripts/generate_eval_report.py --help` for all options.
4689

4790
---
4891

49-
## Benchmark Suites
50-
51-
| Suite | Tasks | Description |
52-
|-------|-------|-------------|
53-
| kubernetes_docs | 5 | K8s package documentation generation |
54-
| big_code_mcp | 12 | Large-repo code navigation & understanding |
55-
| github_mined | 30 | Real GitHub SWE tasks (Harbor format) |
56-
| locobench_agent | 50 | LoCoBench long-context agent tasks |
57-
| sweperf | 20 | SWE-Bench performance optimization |
58-
| tac_mcp_value | 20 | TheAgentCompany MCP-value tasks |
59-
| 10figure | 0 | 10-Figure corpus (pending data) |
92+
## Running with Harbor
6093

61-
---
94+
Each benchmark has a shell runner in `ralph/configs/` that executes all tasks across the 3-config matrix:
6295

63-
## select_tasks.py Scripts
96+
```bash
97+
# Run all 50 LoCoBench tasks across 3 configs
98+
bash ralph/configs/locobench_3config.sh
6499

65-
Some benchmarks include `select_tasks.py` scripts that require the Dashboard repo's `src/` package on `PYTHONPATH`. To run them:
100+
# Run only the baseline config
101+
bash ralph/configs/locobench_3config.sh --baseline-only
66102

67-
```bash
68-
export PYTHONPATH="/path/to/CodeContextBench_Dashboard:$PYTHONPATH"
69-
python benchmarks/<suite>/select_tasks.py
103+
# Run only MCP-Full config
104+
bash ralph/configs/locobench_3config.sh --full-only
70105
```
71106

107+
Available runners: `locobench_3config.sh`, `swebenchpro_3config.sh`, `bigcode_3config.sh`, `k8s_docs_3config.sh`.
108+
109+
Requires [Harbor](https://github.com/mainmatter/harbor) installed and configured with a Claude API key.
110+
72111
---
73112

74113
## License

0 commit comments

Comments
 (0)