Skip to content

Commit b607290

Browse files
sjarmakcursoragent
andcommitted
Add deterministic control plane, repo health gate, and agent-facing instructions
- Control plane: configs/control_plane_ccb.yaml, scripts/control_plane.py, docs/CONTROL_PLANE.md for deterministic run manifest from spec + task source - Repo health: configs/repo_health.json, scripts/repo_health.py, docs/REPO_HEALTH.md; single gate (docs consistency, selection file, task preflight) to reduce drift and keep branches clean - CLAUDE.md/AGENTS.md: step 0 and skill routing for 'before commit/push' run repo_health; Repo health section and canonical ref to REPO_HEALTH.md - Skill repo-health (skills/repo-health/SKILL.md): trigger on commit/push, reduce drift; maintenance.md and SKILLS.md updated - docs_consistency_check: add REPO_HEALTH.md to default docs - .github/workflows/repo_health.yml added (if .gitignore allows) Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent 56159d5 commit b607290

File tree

13 files changed

+856
-3
lines changed

13 files changed

+856
-3
lines changed

AGENTS.md

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ per-task details.
2020
- `docs/WORKFLOW_METRICS.md` - timing/cost metric definitions
2121
- `docs/AGENT_INTERFACE.md` - runtime I/O contract
2222
- `docs/EXTENSIBILITY.md` - safe suite/task/config extension
23+
- `docs/REPO_HEALTH.md` - health gate and branch hygiene (reduce drift)
2324
- `docs/LEADERBOARD.md` - ranking policy
2425
- `docs/SUBMISSION.md` - submission format
2526
- `docs/SKILLS.md` - AI agent skill system overview
@@ -28,6 +29,7 @@ per-task details.
2829
## Typical Skill Routing
2930
Use these defaults unless there is a task-specific reason not to.
3031

32+
- **Before commit or push:** `repo-health` — run `python3 scripts/repo_health.py` (or `--quick` if only docs/config changed). Do not commit/push with failing health checks.
3133
- Pre-run readiness: `check-infra`, `validate-tasks`
3234
- Launch/runs: `run-benchmark`, `run-status`, `watch-benchmarks`
3335
- Failure investigation: `triage-failure`, `quick-rerun`
@@ -45,6 +47,7 @@ See `docs/CONFIGS.md` for the full environment model, tool lists, and how to
4547
add sg_only support to new tasks.
4648

4749
## Standard Workflow
50+
0. **Before commit or push:** Run `python3 scripts/repo_health.py` (or `--quick`). Fix any failures so main stays clean and drift is caught early (see `docs/REPO_HEALTH.md`).
4851
1. Run infrastructure checks before any batch.
4952
2. Validate task integrity before launch (include runtime smoke for new/changed tasks).
5053
3. Run the benchmark config (`configs/*_2config.sh` or equivalent).
@@ -92,6 +95,7 @@ python3 scripts/generate_eval_report.py
9295
python3 scripts/abc_audit.py --suite <suite> # quality audit
9396
python3 scripts/abc_score_task.py --suite <suite> # per-task quality score
9497
python3 scripts/docs_consistency_check.py # documentation drift guard
98+
python3 scripts/repo_health.py # repo health gate (before push); --quick for fast check
9599
```
96100

97101
## Script Entrypoints
@@ -126,11 +130,14 @@ python3 scripts/docs_consistency_check.py # documentation drift guard
126130
- `audit_traces.py` - agent trace auditing
127131
- `ds_audit.py` - Deep Search usage audit
128132

133+
### Repo health (reduce drift, clean branches)
134+
- `repo_health.py` - single gate: docs consistency + selection file + task preflight (see docs/REPO_HEALTH.md)
135+
- `docs_consistency_check.py` - documentation drift guard
136+
129137
### Quality Assurance
130138
- `abc_audit.py` - ABC benchmark quality audit (32 criteria across 3 dimensions)
131139
- `abc_score_task.py` - per-task quality scoring
132140
- `abc_criteria.py` - ABC criteria data model
133-
- `docs_consistency_check.py` - documentation drift guard
134141
- `validate_official_integrity.py` - official run integrity checks
135142
- `quarantine_invalid_tasks.py` - quarantine tasks with zero MCP usage
136143

CLAUDE.md

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ per-task details.
2020
- `docs/WORKFLOW_METRICS.md` - timing/cost metric definitions
2121
- `docs/AGENT_INTERFACE.md` - runtime I/O contract
2222
- `docs/EXTENSIBILITY.md` - safe suite/task/config extension
23+
- `docs/REPO_HEALTH.md` - health gate and branch hygiene (reduce drift)
2324
- `docs/LEADERBOARD.md` - ranking policy
2425
- `docs/SUBMISSION.md` - submission format
2526
- `docs/SKILLS.md` - AI agent skill system overview
@@ -28,6 +29,7 @@ per-task details.
2829
## Typical Skill Routing
2930
Use these defaults unless there is a task-specific reason not to.
3031

32+
- **Before commit or push:** `repo-health` — run `python3 scripts/repo_health.py` (or `--quick` if only docs/config changed). Do not commit/push with failing health checks.
3133
- Pre-run readiness: `check-infra`, `validate-tasks`
3234
- Launch/runs: `run-benchmark`, `run-status`, `watch-benchmarks`
3335
- Failure investigation: `triage-failure`, `quick-rerun`
@@ -45,6 +47,7 @@ See `docs/CONFIGS.md` for the full environment model, tool lists, and how to
4547
add sg_only support to new tasks.
4648

4749
## Standard Workflow
50+
0. **Before commit or push:** Run `python3 scripts/repo_health.py` (or `--quick`). Fix any failures so main stays clean and drift is caught early (see `docs/REPO_HEALTH.md`).
4851
1. Run infrastructure checks before any batch.
4952
2. Validate task integrity before launch (include runtime smoke for new/changed tasks).
5053
3. Run the benchmark config (`configs/*_2config.sh` or equivalent).
@@ -92,6 +95,7 @@ python3 scripts/generate_eval_report.py
9295
python3 scripts/abc_audit.py --suite <suite> # quality audit
9396
python3 scripts/abc_score_task.py --suite <suite> # per-task quality score
9497
python3 scripts/docs_consistency_check.py # documentation drift guard
98+
python3 scripts/repo_health.py # repo health gate (before push); --quick for fast check
9599
```
96100

97101
## Script Entrypoints
@@ -126,11 +130,14 @@ python3 scripts/docs_consistency_check.py # documentation drift guard
126130
- `audit_traces.py` - agent trace auditing
127131
- `ds_audit.py` - Deep Search usage audit
128132

133+
### Repo health (reduce drift, clean branches)
134+
- `repo_health.py` - single gate: docs consistency + selection file + task preflight (see docs/REPO_HEALTH.md)
135+
- `docs_consistency_check.py` - documentation drift guard
136+
129137
### Quality Assurance
130138
- `abc_audit.py` - ABC benchmark quality audit (32 criteria across 3 dimensions)
131139
- `abc_score_task.py` - per-task quality scoring
132140
- `abc_criteria.py` - ABC criteria data model
133-
- `docs_consistency_check.py` - documentation drift guard
134141
- `validate_official_integrity.py` - official run integrity checks
135142
- `quarantine_invalid_tasks.py` - quarantine tasks with zero MCP usage
136143

configs/control_plane_ccb.yaml

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
# Deterministic control plane for CodeContextBench 2-config runs.
2+
# Same file + same task source → same experiment_id and run list.
3+
#
4+
# Generate manifest:
5+
# python3 scripts/control_plane.py generate --spec configs/control_plane_ccb.yaml --output runs/staging/<experiment_id>/manifest.json
6+
# Dry-run:
7+
# python3 scripts/control_plane.py generate --spec configs/control_plane_ccb.yaml --dry-run
8+
9+
experiment_name: ccb_2config
10+
description: "CCB 157 tasks × baseline + sourcegraph_full"
11+
run_category: staging
12+
13+
# Path to task list (relative to repo root). Must have .tasks[].benchmark, .tasks[].task_id, .tasks[].task_dir
14+
task_source: configs/selected_benchmark_tasks.json
15+
16+
# Optional: limit to one benchmark (e.g. ccb_fix). Omit or empty string = all benchmarks.
17+
benchmark_filter: ""
18+
19+
models:
20+
- anthropic/claude-opus-4-6
21+
22+
mcp_modes:
23+
- baseline
24+
- sourcegraph_full
25+
26+
seeds: [0]

configs/repo_health.json

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
{
2+
"version": "1",
3+
"description": "Contract for repo health: run these checks to keep the tree clean and reduce drift before commit/push.",
4+
"checks": {
5+
"docs_consistency": {
6+
"script": "scripts/docs_consistency_check.py",
7+
"required": true,
8+
"description": "Doc and config references exist; eval_matrix valid"
9+
},
10+
"task_preflight_static": {
11+
"script": "scripts/validate_tasks_preflight.py",
12+
"args": ["--all"],
13+
"required": true,
14+
"description": "Task definitions valid (instruction length, test.sh, no placeholders)"
15+
},
16+
"selection_file": {
17+
"script": null,
18+
"required": true,
19+
"description": "configs/selected_benchmark_tasks.json exists and is valid JSON"
20+
}
21+
},
22+
"quick_checks": [
23+
"docs_consistency",
24+
"selection_file"
25+
],
26+
"branch_hygiene": {
27+
"recommendations": [
28+
"Run repo_health (or repo_health --quick) before push",
29+
"Merge working state often; keep branches short",
30+
"After changing docs or configs, run docs_consistency to catch drift"
31+
]
32+
}
33+
}

docs/CONTROL_PLANE.md

Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
# Deterministic Control Plane
2+
3+
This document describes how to use the **deterministic control plane** in CodeContextBench: a single declarative spec that defines exactly which runs execute, with stable experiment/run/pair IDs and ordering. Same spec + same task source → same manifest every time.
4+
5+
## Rationale
6+
7+
- **Single source of truth**: "What to run" is defined in one place (control plane spec + task list), not scattered across CLI flags and shell logic.
8+
- **Reproducibility**: Experiment ID, run IDs, and pair IDs are derived from invariants (config hash, task set, model, seeds) so re-runs and comparisons are stable.
9+
- **Separation of concerns**: Control plane = *what* to run (task × config × seed matrix). Execution (Harbor, Docker) = *how* it runs.
10+
11+
## Components
12+
13+
| Component | Role |
14+
|-----------|------|
15+
| **Control plane spec** | YAML that defines experiment name, task source, benchmark filter, configs, model, seeds, category. |
16+
| **Task source** | Canonical task list (e.g. `configs/selected_benchmark_tasks.json`). |
17+
| **Manifest generator** | Script that reads spec + task source, sorts tasks deterministically, computes IDs via `lib.matrix.id_generator`, and writes a **run manifest** (JSON). |
18+
| **Runner** | Existing `run_selected_tasks.sh` or a manifest-driven wrapper; executes each run from the manifest so ordering and IDs are fixed. |
19+
20+
## Control plane spec (YAML)
21+
22+
Example: `configs/control_plane_ccb.yaml`
23+
24+
```yaml
25+
# Deterministic control plane for CodeContextBench 2-config runs.
26+
# Same file + same task source → same experiment_id and run list.
27+
28+
experiment_name: ccb_2config
29+
description: "CCB 157 tasks × baseline + sourcegraph_full"
30+
run_category: staging
31+
32+
# Where to get tasks (must have .tasks[].benchmark, .tasks[].task_id, .tasks[].task_dir)
33+
task_source: configs/selected_benchmark_tasks.json
34+
35+
# Optional: limit to one benchmark (e.g. ccb_fix). Omit or empty = all benchmarks.
36+
benchmark_filter: ""
37+
38+
models:
39+
- anthropic/claude-opus-4-6
40+
41+
mcp_modes:
42+
- baseline
43+
- sourcegraph_full
44+
45+
seeds: [0]
46+
```
47+
48+
- **experiment_id** is computed from `experiment_name` + hash of the spec (and optionally task source path), so it is deterministic.
49+
- **run_id** / **pair_id** use the existing `lib.matrix.id_generator` (task_id, model, mcp_mode, seed, experiment_id).
50+
51+
## Determinism
52+
53+
Same spec file + same task source file → same `experiment_id`, same `run_id` and `pair_id` for every run, and same ordering. The only field that changes between invocations is `generated_at` in the manifest.
54+
55+
## Generating the manifest
56+
57+
From the repo root:
58+
59+
```bash
60+
# Generate manifest only (no execution)
61+
python3 scripts/control_plane.py generate --spec configs/control_plane_ccb.yaml --output runs/staging/manifest.json
62+
63+
# Dry-run: print what would be run
64+
python3 scripts/control_plane.py generate --spec configs/control_plane_ccb.yaml --dry-run
65+
```
66+
67+
The manifest JSON looks like:
68+
69+
```json
70+
{
71+
"experiment_id": "exp_ccb2config_2026-02-18_abc123",
72+
"experiment_name": "ccb_2config",
73+
"run_category": "staging",
74+
"generated_at": "2026-02-18T12:00:00Z",
75+
"runs": [
76+
{
77+
"run_id": "run_baseline_opus_..._seed0_xyz",
78+
"pair_id": "pair_opus_..._seed0_...",
79+
"task_id": "...",
80+
"task_dir": "ccb_design/...",
81+
"benchmark": "ccb_design",
82+
"mcp_mode": "baseline",
83+
"model": "anthropic/claude-opus-4-6",
84+
"seed": 0
85+
}
86+
],
87+
"pairs": [ ... ]
88+
}
89+
```
90+
91+
## Using the manifest to drive runs
92+
93+
**Option A – Keep current runner, add optional manifest mode**
94+
95+
- Add a flag to `run_selected_tasks.sh`: e.g. `--manifest runs/staging/<experiment_id>/manifest.json`.
96+
- When `--manifest` is set, the script reads `manifest["runs"]`, iterates in order, and for each run invokes `harbor run --path ...` with the task_dir from the manifest. Output directories can include `run_id` so they are stable.
97+
98+
**Option B – Manifest as input to a thin Python runner**
99+
100+
- A small script (e.g. `scripts/run_from_manifest.py`) that reads the manifest and for each run calls Harbor (or shells out to the same `harbor run` logic), so all execution is manifest-driven.
101+
102+
Either way, the **control plane** is the spec + manifest; the runner is a consumer of the manifest.
103+
104+
## Relation to existing v2 experiment YAMLs
105+
106+
The repo already has a **v2 experiment path** (`lib/config`, `lib/matrix/expander`, `run-eval run -c experiment.yaml`) that uses Harbor’s **registry** and dataset/task_names. That path is well-suited to benchmarks like swebenchpro that are in the registry.
107+
108+
The **control plane layer** described here is complementary:
109+
110+
- For **CCB in-repo tasks** (benchmarks under `benchmarks/ccb_*`), the control plane spec + manifest generator use the same deterministic ID logic (`id_generator`) but drive the **path-based** runner (`harbor run --path`), which does not require a registry.
111+
- You can later unify by having the manifest generator emit an experiment YAML (or RunSpec list) consumable by the v2 runner if CCB is ever registered in Harbor.
112+
113+
## Checklist for a new deterministic run
114+
115+
1. Ensure `configs/selected_benchmark_tasks.json` (or your task source) is up to date.
116+
2. Create or edit a control plane spec (e.g. `configs/control_plane_ccb.yaml`).
117+
3. Run `python3 scripts/control_plane.py generate --spec ... --output ...` to produce the manifest.
118+
4. Run the benchmark using that manifest (e.g. `run_selected_tasks.sh --manifest runs/staging/<exp_id>/manifest.json` or `scripts/run_from_manifest.py ...`).
119+
5. Post-run: `generate_manifest.py`, `generate_eval_report.py`, etc. can key off `experiment_id` and run IDs from the control plane manifest for consistent reporting.
120+
121+
## Files
122+
123+
| File | Purpose |
124+
|------|---------|
125+
| `docs/CONTROL_PLANE.md` | This design and usage. |
126+
| `configs/control_plane_ccb.yaml` | Example control plane spec for CCB 2-config. |
127+
| `scripts/control_plane.py` | Manifest generator: spec + task source → manifest JSON. |
128+
| `lib/matrix/id_generator.py` | Deterministic experiment_id, run_id, pair_id (unchanged). |

docs/REPO_HEALTH.md

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
# Repo Health Gate
2+
3+
Lightweight checks to **commit working solutions often** and **reduce entropy** (doc drift, broken task refs, invalid config). One command before push; same checks in CI.
4+
5+
## Goal
6+
7+
- **Catch drift early** — docs referencing missing files, eval_matrix inconsistent with configs, tasks in selection with no benchmark dir.
8+
- **Keep branches clean** — run the gate before push so main stays green; merge small, working changes.
9+
- **Single contract**`configs/repo_health.json` defines what “healthy” means; no scattered scripts or tribal knowledge.
10+
11+
## Running the health gate
12+
13+
From repo root:
14+
15+
```bash
16+
# Full health (docs + config + task preflight static)
17+
python3 scripts/repo_health.py
18+
19+
# Quick health (docs + selection file only; no full task sweep)
20+
python3 scripts/repo_health.py --quick
21+
22+
# Exit code: 0 = all required checks passed, 1 = at least one failed
23+
```
24+
25+
Use **`--quick`** for fast feedback (e.g. pre-commit or after editing only docs/config). Use **full** before merging or before a benchmark run.
26+
27+
## What gets checked
28+
29+
| Check | Quick | Full | Purpose |
30+
|-------|-------|------|--------|
31+
| **docs_consistency** ||| Referenced scripts/configs/docs exist; eval_matrix valid. |
32+
| **selection_file** ||| `configs/selected_benchmark_tasks.json` exists and is valid JSON. |
33+
| **task_preflight_static** ||| All selected tasks: instruction length, test.sh, no placeholders, registry match. |
34+
35+
Contract: `configs/repo_health.json`. Add or relax checks there without changing this doc (then run `docs_consistency_check` so new script refs are valid).
36+
37+
## Branch hygiene (recommendations)
38+
39+
- **Run health before push**`python3 scripts/repo_health.py` or `--quick` so you don’t push broken refs or task defs.
40+
- **Merge working state often** — small PRs that pass the gate reduce long-lived branches and merge conflicts.
41+
- **After editing docs/config** — run at least `python3 scripts/docs_consistency_check.py` to catch missing refs and matrix drift.
42+
43+
## Fixing common failures
44+
45+
- **missing_ref:README.md:scripts/docs_consistency_check.py** — Remove or fix the reference in the doc, or add the missing file.
46+
- **eval_matrix_*** — Fix `configs/eval_matrix.json` (supported_configs, official_default_configs, config_definitions).
47+
- **Task preflight errors** — Run `python3 scripts/validate_tasks_preflight.py --all` and fix reported tasks (instruction length, test.sh, placeholders, or sync task list).
48+
49+
## CI
50+
51+
The health gate runs in CI on push/PR (see `.github/workflows/repo_health.yml`). Fix failures before merging so main stays clean.

docs/SKILLS.md

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
# Skills System
2+
3+
CodeContextBench includes a set of **AI agent skill definitions** in the [`skills/`](../skills/) directory. These are structured markdown runbooks that encode operational knowledge for common benchmark workflows, enabling any AI coding agent to operate the benchmark suite reliably.
4+
5+
## Overview
6+
7+
Skills solve a practical problem: running a benchmark involves many multi-step workflows (infrastructure checks, task validation, run monitoring, failure triage, report generation) that are tedious to re-explain each session. By encoding these as structured files, any agent — Claude Code, Cursor, Copilot, or others — can follow them autonomously.
8+
9+
## Skill Categories
10+
11+
### CCB Operations (`skills/ccb/`)
12+
13+
Project-specific skills for operating the CodeContextBench pipeline:
14+
15+
| File | Skills | Purpose |
16+
|------|--------|---------|
17+
| `pre-run.md` | Check Infrastructure, Validate Tasks, Run Benchmark | Pre-launch readiness and execution |
18+
| `monitoring.md` | Run Status, Watch Benchmarks | Active run monitoring |
19+
| `triage-rerun.md` | Triage Failure, Quick Rerun | Failure investigation and fix verification |
20+
| `analysis.md` | Compare Configs, MCP Audit, IR Analysis, Cost Report, Evaluate Traces | Post-run analysis |
21+
| `maintenance.md` | Repo Health, Sync Metadata, Re-extract Metrics, Archive Run, Generate Report, What's Next | Data hygiene, health gate, reporting |
22+
| `task-authoring.md` | Scaffold Task, Score Tasks, Benchmark Audit | Task creation and quality assurance |
23+
24+
### General Purpose (`skills/general/`)
25+
26+
Reusable skills applicable to any software project:
27+
28+
| File | Skills | Purpose |
29+
|------|--------|---------|
30+
| `workflow-tools.md` | Session Handoff, Strategic Compact, PRD Generator, Ralph Agent, Eval Harness | Session and workflow management |
31+
| `agent-delegation.md` | Delegate, Codex/Cursor/Copilot/Gemini CLI Guides | Multi-agent task routing |
32+
| `deep-search-clickhouse.md` | Deep Search CLI, ClickHouse Patterns | Semantic search and analytics |
33+
| `dev-practices.md` | Security Review, Coding Standards, TDD, Verification Loop, Frontend/Backend Patterns | Development best practices |
34+
35+
## Integration
36+
37+
### Cursor
38+
39+
Skills originated as `.cursor/rules/*.mdc` files. To use them with Cursor, copy into `.cursor/rules/` and add YAML front-matter with `description` and optional `globs` fields. See [`skills/README.md`](../skills/README.md) for details.
40+
41+
### Claude Code
42+
43+
Reference skill files from `CLAUDE.md` or `AGENTS.md`. The agent reads referenced files on demand.
44+
45+
### Other Agents
46+
47+
Skills are plain markdown — any file-reading agent can use them directly.
48+
49+
## Creating New Skills
50+
51+
See the [Adapting for Your Own Project](../skills/README.md#adapting-for-your-own-project) section in the skills README for guidance on writing skills for your own workflows.
52+
53+
## Related Documentation
54+
55+
- [`skills/README.md`](../skills/README.md) — Full skill index and usage guide
56+
- [`CLAUDE.md`](../CLAUDE.md) / [`AGENTS.md`](../AGENTS.md) — Operational quick-reference (references skills)
57+
- [`docs/QA_PROCESS.md`](QA_PROCESS.md) — Quality assurance pipeline (skills automate parts of this)
58+
- [`docs/ERROR_CATALOG.md`](ERROR_CATALOG.md) — Known error patterns (used by triage skill)

0 commit comments

Comments
 (0)