Skip to content

Commit 72a33c0

Browse files
author
LoCoBench Bot
committed
Add run curation guardrails, config matrix, and docs consistency checks
1 parent 802dbc6 commit 72a33c0

20 files changed

+3006
-42
lines changed
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
name: Docs Consistency
2+
3+
on:
4+
pull_request:
5+
push:
6+
branches:
7+
- main
8+
9+
jobs:
10+
docs-consistency:
11+
runs-on: ubuntu-latest
12+
steps:
13+
- name: Checkout
14+
uses: actions/checkout@v4
15+
16+
- name: Setup Python
17+
uses: actions/setup-python@v5
18+
with:
19+
python-version: "3.11"
20+
21+
- name: Validate docs references
22+
run: python3 scripts/docs_consistency_check.py

AGENTS.md

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -16,27 +16,27 @@ Benchmark tasks are executed via **Harbor** (Docker container-based runner) with
1616

1717
```bash
1818
# Sequential (default)
19-
./configs/swebenchpro_3config.sh
19+
./configs/swebenchpro_2config.sh
2020

2121
# Parallel with auto-detected concurrency
22-
./configs/swebenchpro_3config.sh --parallel
22+
./configs/swebenchpro_2config.sh --parallel
2323

2424
# Parallel with explicit job count
25-
./configs/swebenchpro_3config.sh --parallel 4
25+
./configs/swebenchpro_2config.sh --parallel 4
2626
```
2727

2828
All 11 benchmark config scripts accept the `--parallel` flag:
29-
- `swebenchpro_3config.sh` — SWE-bench Pro (36 tasks)
30-
- `pytorch_3config.sh` — PyTorch (12 tasks)
31-
- `locobench_3config.sh` — LoCoBench (25 tasks)
32-
- `repoqa_3config.sh` — RepoQA (10 tasks)
33-
- `k8s_docs_3config.sh` — Kubernetes Docs (5 tasks)
34-
- `crossrepo_3config.sh` — Cross-Repo (4-5 tasks)
35-
- `largerepo_3config.sh` — Large Repo (4 tasks)
36-
- `tac_3config.sh` — TAC (8 tasks)
37-
- `dibench_3config.sh` — DIBench (8 tasks)
38-
- `sweperf_3config.sh` — SWE-Perf (3 tasks)
39-
- `linuxflbench_3config.sh` — LinuxFLBench (5 tasks)
29+
- `swebenchpro_2config.sh` — SWE-bench Pro (36 tasks)
30+
- `pytorch_2config.sh` — PyTorch (12 tasks)
31+
- `locobench_2config.sh` — LoCoBench (25 tasks)
32+
- `repoqa_2config.sh` — RepoQA (10 tasks)
33+
- `k8s_docs_2config.sh` — Kubernetes Docs (5 tasks)
34+
- `crossrepo_2config.sh` — Cross-Repo (4-5 tasks)
35+
- `largerepo_2config.sh` — Large Repo (4 tasks)
36+
- `tac_2config.sh` — TAC (8 tasks)
37+
- `dibench_2config.sh` — DIBench (8 tasks)
38+
- `sweperf_2config.sh` — SWE-Perf (3 tasks)
39+
- `linuxflbench_2config.sh` — LinuxFLBench (5 tasks)
4040

4141
### Config Scripts Structure
4242

@@ -230,7 +230,7 @@ After all runs complete:
230230

231231
```bash
232232
python3 scripts/generate_manifest.py # Regenerate MANIFEST.json
233-
python3 scripts/generate_report.py # Aggregate results into report
233+
python3 scripts/generate_eval_report.py # Aggregate results into report
234234
```
235235

236236
The MANIFEST tracks all runs, task counts, pass/fail rates, and mean rewards.

CLAUDE.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -107,7 +107,7 @@ The script generates an OAuth URL — open it in your local browser, log in, pas
107107
python3 scripts/generate_manifest.py
108108

109109
# Generate evaluation report
110-
python3 scripts/generate_report.py
110+
python3 scripts/generate_eval_report.py
111111

112112
# Select benchmark tasks
113113
python3 scripts/select_benchmark_tasks.py
@@ -163,7 +163,7 @@ MAINTENANCE
163163
|-------|--------|---------|
164164
| `/compare-configs` | `scripts/compare_configs.py` | Show divergent tasks across baseline/SG_full, "MCP helps" vs "MCP hurts". Now includes optional MCP-conditioned analysis. |
165165
| `/cost-report` | `scripts/cost_report.py` | Token usage and estimated cost by suite/config, most expensive tasks |
166-
| `/generate-report` | `scripts/generate_report.py` | Aggregate CCB evaluation report from completed runs |
166+
| `/generate-report` | `scripts/generate_eval_report.py` | Aggregate CCB evaluation report from completed runs |
167167
| `/evaluate-traces` | `scripts/audit_traces.py` | Comprehensive trace evaluation: data integrity, output quality, efficiency analysis. Includes zero-MCP vs used-MCP classification. |
168168
| `/mcp-audit` | `scripts/mcp_audit.py` | MCP usage patterns: used vs zero-MCP, intensity buckets, reward/time deltas conditioned on actual MCP adoption |
169169

README.md

Lines changed: 25 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -60,19 +60,19 @@ benchmarks/ # Task definitions organized by benchmark suite
6060
ccb_tac/ # TheAgentCompany tasks (8 tasks)
6161
configs/ # 3-config comparison shell runners + task selection
6262
_common.sh # Shared infra: token refresh, parallel execution, multi-account
63-
codereview_3config.sh # Per-suite runner: CodeReview (3 tasks)
64-
crossrepo_3config.sh # Per-suite runner: CrossRepo (5 tasks)
65-
dibench_3config.sh # Per-suite runner: DIBench (8 tasks)
66-
k8s_docs_3config.sh # Per-suite runner: K8s Docs (5 tasks)
67-
largerepo_3config.sh # Per-suite runner: Large Repo (4 tasks)
68-
linuxflbench_3config.sh # Per-suite runner: LinuxFLBench (5 tasks)
69-
locobench_3config.sh # Per-suite runner: LoCoBench (25 tasks)
70-
pytorch_3config.sh # Per-suite runner: PyTorch (12 tasks)
71-
repoqa_3config.sh # Per-suite runner: RepoQA (10 tasks)
63+
codereview_2config.sh # Per-suite runner: CodeReview (3 tasks)
64+
crossrepo_2config.sh # Per-suite runner: CrossRepo (5 tasks)
65+
dibench_2config.sh # Per-suite runner: DIBench (8 tasks)
66+
k8s_docs_2config.sh # Per-suite runner: K8s Docs (5 tasks)
67+
largerepo_2config.sh # Per-suite runner: Large Repo (4 tasks)
68+
linuxflbench_2config.sh # Per-suite runner: LinuxFLBench (5 tasks)
69+
locobench_2config.sh # Per-suite runner: LoCoBench (25 tasks)
70+
pytorch_2config.sh # Per-suite runner: PyTorch (12 tasks)
71+
dependeval_2config.sh # Per-suite runner: DependEval (32 tasks)
7272
run_selected_tasks.sh # Unified runner for all tasks
73-
swebenchpro_3config.sh # Per-suite runner: SWE-Bench Pro (36 tasks)
74-
sweperf_3config.sh # Per-suite runner: SWE-Perf (3 tasks)
75-
tac_3config.sh # Per-suite runner: TheAgentCompany (8 tasks)
73+
swebenchpro_2config.sh # Per-suite runner: SWE-Bench Pro (36 tasks)
74+
sweperf_2config.sh # Per-suite runner: SWE-Perf (3 tasks)
75+
tac_2config.sh # Per-suite runner: TheAgentCompany (8 tasks)
7676
selected_benchmark_tasks.json # Canonical task selection with metadata
7777
scripts/ # Metrics extraction, evaluation, and operational tooling
7878
ccb_metrics/ # Python package: models, extractors, discovery, judge context
@@ -148,25 +148,25 @@ bash configs/run_selected_tasks.sh --dry-run
148148
Per-suite runners are also available for individual benchmarks:
149149

150150
```bash
151-
bash configs/swebenchpro_3config.sh # 36 SWE-Bench Pro tasks
152-
bash configs/locobench_3config.sh # 25 LoCoBench tasks
153-
bash configs/pytorch_3config.sh # 12 PyTorch tasks
154-
bash configs/repoqa_3config.sh # 10 RepoQA tasks
155-
bash configs/tac_3config.sh # 8 TheAgentCompany tasks
156-
bash configs/dibench_3config.sh # 8 DIBench tasks
157-
bash configs/crossrepo_3config.sh # 5 CrossRepo tasks
158-
bash configs/k8s_docs_3config.sh # 5 K8s Docs tasks
159-
bash configs/linuxflbench_3config.sh # 5 LinuxFLBench tasks (see note below)
160-
bash configs/largerepo_3config.sh # 4 Large Repo tasks
161-
bash configs/sweperf_3config.sh # 3 SWE-Perf tasks
162-
bash configs/codereview_3config.sh # 3 CodeReview tasks
151+
bash configs/swebenchpro_2config.sh # 36 SWE-Bench Pro tasks
152+
bash configs/locobench_2config.sh # 25 LoCoBench tasks
153+
bash configs/pytorch_2config.sh # 12 PyTorch tasks
154+
bash configs/dependeval_2config.sh # 32 DependEval tasks
155+
bash configs/tac_2config.sh # 8 TheAgentCompany tasks
156+
bash configs/dibench_2config.sh # 8 DIBench tasks
157+
bash configs/crossrepo_2config.sh # 5 CrossRepo tasks
158+
bash configs/k8s_docs_2config.sh # 5 K8s Docs tasks
159+
bash configs/linuxflbench_2config.sh # 5 LinuxFLBench tasks (see note below)
160+
bash configs/largerepo_2config.sh # 4 Large Repo tasks
161+
bash configs/sweperf_2config.sh # 3 SWE-Perf tasks
162+
bash configs/codereview_2config.sh # 3 CodeReview tasks
163163
```
164164

165165
All runners support `--baseline-only` and `--full-only` flags.
166166

167167
**LinuxFLBench note:** Docker image build is slow (~10 min) due to Linux kernel partial clone (~2GB). Pre-build images before running to save time.
168168

169-
**DependEval note:** DependEval tasks use `--path` mode with local task directories. There is no unified `dependeval_3config.sh` yet; tasks are tracked via `configs/dependeval_selected_instances.json`.
169+
**DependEval note:** DependEval runs use local task directories and are handled by `configs/dependeval_2config.sh`.
170170

171171
Requires [Harbor](https://github.com/laude-institute/harbor/tree/main) installed and configured with a Claude API key.
172172

configs/eval_matrix.json

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
{
2+
"description": "Canonical benchmark config matrix and extension registry for CodeContextBench.",
3+
"official_default_configs": [
4+
"baseline",
5+
"sourcegraph_full"
6+
],
7+
"supported_configs": [
8+
"baseline",
9+
"sourcegraph_base",
10+
"sourcegraph_full",
11+
"sourcegraph_isolated",
12+
"github_base",
13+
"github_full"
14+
],
15+
"config_definitions": {
16+
"baseline": {
17+
"baseline_mcp_type": "none",
18+
"mcp_enabled": false,
19+
"provider": "none",
20+
"track_in_official": true,
21+
"status": "active"
22+
},
23+
"sourcegraph_base": {
24+
"baseline_mcp_type": "sourcegraph_base",
25+
"mcp_enabled": true,
26+
"provider": "sourcegraph",
27+
"track_in_official": true,
28+
"status": "legacy_or_targeted"
29+
},
30+
"sourcegraph_full": {
31+
"baseline_mcp_type": "sourcegraph_full",
32+
"mcp_enabled": true,
33+
"provider": "sourcegraph",
34+
"track_in_official": true,
35+
"status": "active"
36+
},
37+
"sourcegraph_isolated": {
38+
"baseline_mcp_type": "sourcegraph_isolated",
39+
"mcp_enabled": true,
40+
"provider": "sourcegraph",
41+
"track_in_official": true,
42+
"status": "experimental"
43+
},
44+
"sourcegraph_only": {
45+
"baseline_mcp_type": "sourcegraph_only",
46+
"mcp_enabled": true,
47+
"provider": "sourcegraph",
48+
"track_in_official": true,
49+
"status": "experimental"
50+
},
51+
"github_base": {
52+
"baseline_mcp_type": "github_base",
53+
"mcp_enabled": true,
54+
"provider": "github",
55+
"track_in_official": false,
56+
"status": "experimental_scaffold"
57+
},
58+
"github_full": {
59+
"baseline_mcp_type": "github_full",
60+
"mcp_enabled": true,
61+
"provider": "github",
62+
"track_in_official": false,
63+
"status": "experimental_scaffold"
64+
}
65+
},
66+
"provider_templates": {
67+
"github": {
68+
"example_config_name": "github_full",
69+
"notes": "Reserved template for future GitHub MCP integration."
70+
}
71+
}
72+
}

0 commit comments

Comments
 (0)