sourcegraph
diff --git a/‎AGENTS.md‎
Lines changed: 8 additions & 1 deletion b/‎AGENTS.md‎
Lines changed: 8 additions & 1 deletion
diff --git a/‎CLAUDE.md‎
Lines changed: 8 additions & 1 deletion b/‎CLAUDE.md‎
Lines changed: 8 additions & 1 deletion
diff --git a/‎configs/control_plane_ccb.yaml‎
Lines changed: 26 additions & 0 deletions b/‎configs/control_plane_ccb.yaml‎
Lines changed: 26 additions & 0 deletions
diff --git a/‎configs/repo_health.json‎
Lines changed: 33 additions & 0 deletions b/‎configs/repo_health.json‎
Lines changed: 33 additions & 0 deletions
diff --git a/‎docs/CONTROL_PLANE.md‎
Lines changed: 128 additions & 0 deletions b/‎docs/CONTROL_PLANE.md‎
Lines changed: 128 additions & 0 deletions
diff --git a/‎docs/REPO_HEALTH.md‎
Lines changed: 51 additions & 0 deletions b/‎docs/REPO_HEALTH.md‎
Lines changed: 51 additions & 0 deletions
diff --git a/‎docs/SKILLS.md‎
Lines changed: 58 additions & 0 deletions b/‎docs/SKILLS.md‎
Lines changed: 58 additions & 0 deletions
@@ -20,6 +20,7 @@ per-task details.
 - `docs/WORKFLOW_METRICS.md` - timing/cost metric definitions
 - `docs/AGENT_INTERFACE.md` - runtime I/O contract
 - `docs/EXTENSIBILITY.md` - safe suite/task/config extension
+- `docs/REPO_HEALTH.md` - health gate and branch hygiene (reduce drift)
 - `docs/LEADERBOARD.md` - ranking policy
 - `docs/SUBMISSION.md` - submission format
 - `docs/SKILLS.md` - AI agent skill system overview
@@ -28,6 +29,7 @@ per-task details.
 ## Typical Skill Routing
 Use these defaults unless there is a task-specific reason not to.
 
+- **Before commit or push:** `repo-health` — run `python3 scripts/repo_health.py` (or `--quick` if only docs/config changed). Do not commit/push with failing health checks.
 - Pre-run readiness: `check-infra`, `validate-tasks`
 - Launch/runs: `run-benchmark`, `run-status`, `watch-benchmarks`
 - Failure investigation: `triage-failure`, `quick-rerun`
@@ -45,6 +47,7 @@ See `docs/CONFIGS.md` for the full environment model, tool lists, and how to
 add sg_only support to new tasks.
 
 ## Standard Workflow
+0. **Before commit or push:** Run `python3 scripts/repo_health.py` (or `--quick`). Fix any failures so main stays clean and drift is caught early (see `docs/REPO_HEALTH.md`).
 1. Run infrastructure checks before any batch.
 2. Validate task integrity before launch (include runtime smoke for new/changed tasks).
 3. Run the benchmark config (`configs/*_2config.sh` or equivalent).
@@ -92,6 +95,7 @@ python3 scripts/generate_eval_report.py
 python3 scripts/abc_audit.py --suite <suite>        # quality audit
 python3 scripts/abc_score_task.py --suite <suite>    # per-task quality score
 python3 scripts/docs_consistency_check.py            # documentation drift guard
+python3 scripts/repo_health.py                      # repo health gate (before push); --quick for fast check
 ```
 
 ## Script Entrypoints
@@ -126,11 +130,14 @@ python3 scripts/docs_consistency_check.py            # documentation drift guard
 - `audit_traces.py` - agent trace auditing
 - `ds_audit.py` - Deep Search usage audit
 
+### Repo health (reduce drift, clean branches)
+- `repo_health.py` - single gate: docs consistency + selection file + task preflight (see docs/REPO_HEALTH.md)
+- `docs_consistency_check.py` - documentation drift guard
+
 ### Quality Assurance
 - `abc_audit.py` - ABC benchmark quality audit (32 criteria across 3 dimensions)
 - `abc_score_task.py` - per-task quality scoring
 - `abc_criteria.py` - ABC criteria data model
-- `docs_consistency_check.py` - documentation drift guard
 - `validate_official_integrity.py` - official run integrity checks
 - `quarantine_invalid_tasks.py` - quarantine tasks with zero MCP usage
 
 
@@ -20,6 +20,7 @@ per-task details.
 - `docs/WORKFLOW_METRICS.md` - timing/cost metric definitions
 - `docs/AGENT_INTERFACE.md` - runtime I/O contract
 - `docs/EXTENSIBILITY.md` - safe suite/task/config extension
+- `docs/REPO_HEALTH.md` - health gate and branch hygiene (reduce drift)
 - `docs/LEADERBOARD.md` - ranking policy
 - `docs/SUBMISSION.md` - submission format
 - `docs/SKILLS.md` - AI agent skill system overview
@@ -28,6 +29,7 @@ per-task details.
 ## Typical Skill Routing
 Use these defaults unless there is a task-specific reason not to.
 
+- **Before commit or push:** `repo-health` — run `python3 scripts/repo_health.py` (or `--quick` if only docs/config changed). Do not commit/push with failing health checks.
 - Pre-run readiness: `check-infra`, `validate-tasks`
 - Launch/runs: `run-benchmark`, `run-status`, `watch-benchmarks`
 - Failure investigation: `triage-failure`, `quick-rerun`
@@ -45,6 +47,7 @@ See `docs/CONFIGS.md` for the full environment model, tool lists, and how to
 add sg_only support to new tasks.
 
 ## Standard Workflow
+0. **Before commit or push:** Run `python3 scripts/repo_health.py` (or `--quick`). Fix any failures so main stays clean and drift is caught early (see `docs/REPO_HEALTH.md`).
 1. Run infrastructure checks before any batch.
 2. Validate task integrity before launch (include runtime smoke for new/changed tasks).
 3. Run the benchmark config (`configs/*_2config.sh` or equivalent).
@@ -92,6 +95,7 @@ python3 scripts/generate_eval_report.py
 python3 scripts/abc_audit.py --suite <suite>        # quality audit
 python3 scripts/abc_score_task.py --suite <suite>    # per-task quality score
 python3 scripts/docs_consistency_check.py            # documentation drift guard
+python3 scripts/repo_health.py                      # repo health gate (before push); --quick for fast check
 ```
 
 ## Script Entrypoints
@@ -126,11 +130,14 @@ python3 scripts/docs_consistency_check.py            # documentation drift guard
 - `audit_traces.py` - agent trace auditing
 - `ds_audit.py` - Deep Search usage audit
 
+### Repo health (reduce drift, clean branches)
+- `repo_health.py` - single gate: docs consistency + selection file + task preflight (see docs/REPO_HEALTH.md)
+- `docs_consistency_check.py` - documentation drift guard
+
 ### Quality Assurance
 - `abc_audit.py` - ABC benchmark quality audit (32 criteria across 3 dimensions)
 - `abc_score_task.py` - per-task quality scoring
 - `abc_criteria.py` - ABC criteria data model
-- `docs_consistency_check.py` - documentation drift guard
 - `validate_official_integrity.py` - official run integrity checks
 - `quarantine_invalid_tasks.py` - quarantine tasks with zero MCP usage
 
 
@@ -0,0 +1,26 @@
+# Deterministic control plane for CodeContextBench 2-config runs.
+# Same file + same task source → same experiment_id and run list.
+#
+# Generate manifest:
+#   python3 scripts/control_plane.py generate --spec configs/control_plane_ccb.yaml --output runs/staging/<experiment_id>/manifest.json
+# Dry-run:
+#   python3 scripts/control_plane.py generate --spec configs/control_plane_ccb.yaml --dry-run
+
+experiment_name: ccb_2config
+description: "CCB 157 tasks × baseline + sourcegraph_full"
+run_category: staging
+
+# Path to task list (relative to repo root). Must have .tasks[].benchmark, .tasks[].task_id, .tasks[].task_dir
+task_source: configs/selected_benchmark_tasks.json
+
+# Optional: limit to one benchmark (e.g. ccb_fix). Omit or empty string = all benchmarks.
+benchmark_filter: ""
+
+models:
+  - anthropic/claude-opus-4-6
+
+mcp_modes:
+  - baseline
+  - sourcegraph_full
+
+seeds: [0]
@@ -0,0 +1,33 @@
+{
+  "version": "1",
+  "description": "Contract for repo health: run these checks to keep the tree clean and reduce drift before commit/push.",
+  "checks": {
+    "docs_consistency": {
+      "script": "scripts/docs_consistency_check.py",
+      "required": true,
+      "description": "Doc and config references exist; eval_matrix valid"
+    },
+    "task_preflight_static": {
+      "script": "scripts/validate_tasks_preflight.py",
+      "args": ["--all"],
+      "required": true,
+      "description": "Task definitions valid (instruction length, test.sh, no placeholders)"
+    },
+    "selection_file": {
+      "script": null,
+      "required": true,
+      "description": "configs/selected_benchmark_tasks.json exists and is valid JSON"
+    }
+  },
+  "quick_checks": [
+    "docs_consistency",
+    "selection_file"
+  ],
+  "branch_hygiene": {
+    "recommendations": [
+      "Run repo_health (or repo_health --quick) before push",
+      "Merge working state often; keep branches short",
+      "After changing docs or configs, run docs_consistency to catch drift"
+    ]
+  }
+}
@@ -0,0 +1,128 @@
+# Deterministic Control Plane
+
+This document describes how to use the **deterministic control plane** in CodeContextBench: a single declarative spec that defines exactly which runs execute, with stable experiment/run/pair IDs and ordering. Same spec + same task source → same manifest every time.
+
+## Rationale
+
+- **Single source of truth**: "What to run" is defined in one place (control plane spec + task list), not scattered across CLI flags and shell logic.
+- **Reproducibility**: Experiment ID, run IDs, and pair IDs are derived from invariants (config hash, task set, model, seeds) so re-runs and comparisons are stable.
+- **Separation of concerns**: Control plane = *what* to run (task × config × seed matrix). Execution (Harbor, Docker) = *how* it runs.
+
+## Components
+
+| Component | Role |
+|-----------|------|
+| **Control plane spec** | YAML that defines experiment name, task source, benchmark filter, configs, model, seeds, category. |
+| **Task source** | Canonical task list (e.g. `configs/selected_benchmark_tasks.json`). |
+| **Manifest generator** | Script that reads spec + task source, sorts tasks deterministically, computes IDs via `lib.matrix.id_generator`, and writes a **run manifest** (JSON). |
+| **Runner** | Existing `run_selected_tasks.sh` or a manifest-driven wrapper; executes each run from the manifest so ordering and IDs are fixed. |
+
+## Control plane spec (YAML)
+
+Example: `configs/control_plane_ccb.yaml`
+
+```yaml
+# Deterministic control plane for CodeContextBench 2-config runs.
+# Same file + same task source → same experiment_id and run list.
+
+experiment_name: ccb_2config
+description: "CCB 157 tasks × baseline + sourcegraph_full"
+run_category: staging
+
+# Where to get tasks (must have .tasks[].benchmark, .tasks[].task_id, .tasks[].task_dir)
+task_source: configs/selected_benchmark_tasks.json
+
+# Optional: limit to one benchmark (e.g. ccb_fix). Omit or empty = all benchmarks.
+benchmark_filter: ""
+
+models:
+  - anthropic/claude-opus-4-6
+
+mcp_modes:
+  - baseline
+  - sourcegraph_full
+
+seeds: [0]
+```
+
+- **experiment_id** is computed from `experiment_name` + hash of the spec (and optionally task source path), so it is deterministic.
+- **run_id** / **pair_id** use the existing `lib.matrix.id_generator` (task_id, model, mcp_mode, seed, experiment_id).
+
+## Determinism
+
+Same spec file + same task source file → same `experiment_id`, same `run_id` and `pair_id` for every run, and same ordering. The only field that changes between invocations is `generated_at` in the manifest.
+
+## Generating the manifest
+
+From the repo root:
+
+```bash
+# Generate manifest only (no execution)
+python3 scripts/control_plane.py generate --spec configs/control_plane_ccb.yaml --output runs/staging/manifest.json
+
+# Dry-run: print what would be run
+python3 scripts/control_plane.py generate --spec configs/control_plane_ccb.yaml --dry-run
+```
+
+The manifest JSON looks like:
+
+```json
+{
+  "experiment_id": "exp_ccb2config_2026-02-18_abc123",
+  "experiment_name": "ccb_2config",
+  "run_category": "staging",
+  "generated_at": "2026-02-18T12:00:00Z",
+  "runs": [
+    {
+      "run_id": "run_baseline_opus_..._seed0_xyz",
+      "pair_id": "pair_opus_..._seed0_...",
+      "task_id": "...",
+      "task_dir": "ccb_design/...",
+      "benchmark": "ccb_design",
+      "mcp_mode": "baseline",
+      "model": "anthropic/claude-opus-4-6",
+      "seed": 0
+    }
+  ],
+  "pairs": [ ... ]
+}
+```
+
+## Using the manifest to drive runs
+
+**Option A – Keep current runner, add optional manifest mode**
+
+- Add a flag to `run_selected_tasks.sh`: e.g. `--manifest runs/staging/<experiment_id>/manifest.json`.
+- When `--manifest` is set, the script reads `manifest["runs"]`, iterates in order, and for each run invokes `harbor run --path ...` with the task_dir from the manifest. Output directories can include `run_id` so they are stable.
+
+**Option B – Manifest as input to a thin Python runner**
+
+- A small script (e.g. `scripts/run_from_manifest.py`) that reads the manifest and for each run calls Harbor (or shells out to the same `harbor run` logic), so all execution is manifest-driven.
+
+Either way, the **control plane** is the spec + manifest; the runner is a consumer of the manifest.
+
+## Relation to existing v2 experiment YAMLs
+
+The repo already has a **v2 experiment path** (`lib/config`, `lib/matrix/expander`, `run-eval run -c experiment.yaml`) that uses Harbor’s **registry** and dataset/task_names. That path is well-suited to benchmarks like swebenchpro that are in the registry.
+
+The **control plane layer** described here is complementary:
+
+- For **CCB in-repo tasks** (benchmarks under `benchmarks/ccb_*`), the control plane spec + manifest generator use the same deterministic ID logic (`id_generator`) but drive the **path-based** runner (`harbor run --path`), which does not require a registry.
+- You can later unify by having the manifest generator emit an experiment YAML (or RunSpec list) consumable by the v2 runner if CCB is ever registered in Harbor.
+
+## Checklist for a new deterministic run
+
+1. Ensure `configs/selected_benchmark_tasks.json` (or your task source) is up to date.
+2. Create or edit a control plane spec (e.g. `configs/control_plane_ccb.yaml`).
+3. Run `python3 scripts/control_plane.py generate --spec ... --output ...` to produce the manifest.
+4. Run the benchmark using that manifest (e.g. `run_selected_tasks.sh --manifest runs/staging/<exp_id>/manifest.json` or `scripts/run_from_manifest.py ...`).
+5. Post-run: `generate_manifest.py`, `generate_eval_report.py`, etc. can key off `experiment_id` and run IDs from the control plane manifest for consistent reporting.
+
+## Files
+
+| File | Purpose |
+|------|---------|
+| `docs/CONTROL_PLANE.md` | This design and usage. |
+| `configs/control_plane_ccb.yaml` | Example control plane spec for CCB 2-config. |
+| `scripts/control_plane.py` | Manifest generator: spec + task source → manifest JSON. |
+| `lib/matrix/id_generator.py` | Deterministic experiment_id, run_id, pair_id (unchanged). |
@@ -0,0 +1,51 @@
+# Repo Health Gate
+
+Lightweight checks to **commit working solutions often** and **reduce entropy** (doc drift, broken task refs, invalid config). One command before push; same checks in CI.
+
+## Goal
+
+- **Catch drift early** — docs referencing missing files, eval_matrix inconsistent with configs, tasks in selection with no benchmark dir.
+- **Keep branches clean** — run the gate before push so main stays green; merge small, working changes.
+- **Single contract** — `configs/repo_health.json` defines what “healthy” means; no scattered scripts or tribal knowledge.
+
+## Running the health gate
+
+From repo root:
+
+```bash
+# Full health (docs + config + task preflight static)
+python3 scripts/repo_health.py
+
+# Quick health (docs + selection file only; no full task sweep)
+python3 scripts/repo_health.py --quick
+
+# Exit code: 0 = all required checks passed, 1 = at least one failed
+```
+
+Use **`--quick`** for fast feedback (e.g. pre-commit or after editing only docs/config). Use **full** before merging or before a benchmark run.
+
+## What gets checked
+
+| Check | Quick | Full | Purpose |
+|-------|-------|------|--------|
+| **docs_consistency** | ✓ | ✓ | Referenced scripts/configs/docs exist; eval_matrix valid. |
+| **selection_file** | ✓ | ✓ | `configs/selected_benchmark_tasks.json` exists and is valid JSON. |
+| **task_preflight_static** | — | ✓ | All selected tasks: instruction length, test.sh, no placeholders, registry match. |
+
+Contract: `configs/repo_health.json`. Add or relax checks there without changing this doc (then run `docs_consistency_check` so new script refs are valid).
+
+## Branch hygiene (recommendations)
+
+- **Run health before push** — `python3 scripts/repo_health.py` or `--quick` so you don’t push broken refs or task defs.
+- **Merge working state often** — small PRs that pass the gate reduce long-lived branches and merge conflicts.
+- **After editing docs/config** — run at least `python3 scripts/docs_consistency_check.py` to catch missing refs and matrix drift.
+
+## Fixing common failures
+
+- **missing_ref:README.md:scripts/docs_consistency_check.py** — Remove or fix the reference in the doc, or add the missing file.
+- **eval_matrix_*** — Fix `configs/eval_matrix.json` (supported_configs, official_default_configs, config_definitions).
+- **Task preflight errors** — Run `python3 scripts/validate_tasks_preflight.py --all` and fix reported tasks (instruction length, test.sh, placeholders, or sync task list).
+
+## CI
+
+The health gate runs in CI on push/PR (see `.github/workflows/repo_health.yml`). Fix failures before merging so main stays clean.
@@ -0,0 +1,58 @@
+# Skills System
+
+CodeContextBench includes a set of **AI agent skill definitions** in the [`skills/`](../skills/) directory. These are structured markdown runbooks that encode operational knowledge for common benchmark workflows, enabling any AI coding agent to operate the benchmark suite reliably.
+
+## Overview
+
+Skills solve a practical problem: running a benchmark involves many multi-step workflows (infrastructure checks, task validation, run monitoring, failure triage, report generation) that are tedious to re-explain each session. By encoding these as structured files, any agent — Claude Code, Cursor, Copilot, or others — can follow them autonomously.
+
+## Skill Categories
+
+### CCB Operations (`skills/ccb/`)
+
+Project-specific skills for operating the CodeContextBench pipeline:
+
+| File | Skills | Purpose |
+|------|--------|---------|
+| `pre-run.md` | Check Infrastructure, Validate Tasks, Run Benchmark | Pre-launch readiness and execution |
+| `monitoring.md` | Run Status, Watch Benchmarks | Active run monitoring |
+| `triage-rerun.md` | Triage Failure, Quick Rerun | Failure investigation and fix verification |
+| `analysis.md` | Compare Configs, MCP Audit, IR Analysis, Cost Report, Evaluate Traces | Post-run analysis |
+| `maintenance.md` | Repo Health, Sync Metadata, Re-extract Metrics, Archive Run, Generate Report, What's Next | Data hygiene, health gate, reporting |
+| `task-authoring.md` | Scaffold Task, Score Tasks, Benchmark Audit | Task creation and quality assurance |
+
+### General Purpose (`skills/general/`)
+
+Reusable skills applicable to any software project:
+
+| File | Skills | Purpose |
+|------|--------|---------|
+| `workflow-tools.md` | Session Handoff, Strategic Compact, PRD Generator, Ralph Agent, Eval Harness | Session and workflow management |
+| `agent-delegation.md` | Delegate, Codex/Cursor/Copilot/Gemini CLI Guides | Multi-agent task routing |
+| `deep-search-clickhouse.md` | Deep Search CLI, ClickHouse Patterns | Semantic search and analytics |
+| `dev-practices.md` | Security Review, Coding Standards, TDD, Verification Loop, Frontend/Backend Patterns | Development best practices |
+
+## Integration
+
+### Cursor
+
+Skills originated as `.cursor/rules/*.mdc` files. To use them with Cursor, copy into `.cursor/rules/` and add YAML front-matter with `description` and optional `globs` fields. See [`skills/README.md`](../skills/README.md) for details.
+
+### Claude Code
+
+Reference skill files from `CLAUDE.md` or `AGENTS.md`. The agent reads referenced files on demand.
+
+### Other Agents
+
+Skills are plain markdown — any file-reading agent can use them directly.
+
+## Creating New Skills
+
+See the [Adapting for Your Own Project](../skills/README.md#adapting-for-your-own-project) section in the skills README for guidance on writing skills for your own workflows.
+
+## Related Documentation
+
+- [`skills/README.md`](../skills/README.md) — Full skill index and usage guide
+- [`CLAUDE.md`](../CLAUDE.md) / [`AGENTS.md`](../AGENTS.md) — Operational quick-reference (references skills)
+- [`docs/QA_PROCESS.md`](QA_PROCESS.md) — Quality assurance pipeline (skills automate parts of this)
+- [`docs/ERROR_CATALOG.md`](ERROR_CATALOG.md) — Known error patterns (used by triage skill)