|
| 1 | +# Deterministic Control Plane |
| 2 | + |
| 3 | +This document describes how to use the **deterministic control plane** in CodeContextBench: a single declarative spec that defines exactly which runs execute, with stable experiment/run/pair IDs and ordering. Same spec + same task source → same manifest every time. |
| 4 | + |
| 5 | +## Rationale |
| 6 | + |
| 7 | +- **Single source of truth**: "What to run" is defined in one place (control plane spec + task list), not scattered across CLI flags and shell logic. |
| 8 | +- **Reproducibility**: Experiment ID, run IDs, and pair IDs are derived from invariants (config hash, task set, model, seeds) so re-runs and comparisons are stable. |
| 9 | +- **Separation of concerns**: Control plane = *what* to run (task × config × seed matrix). Execution (Harbor, Docker) = *how* it runs. |
| 10 | + |
| 11 | +## Components |
| 12 | + |
| 13 | +| Component | Role | |
| 14 | +|-----------|------| |
| 15 | +| **Control plane spec** | YAML that defines experiment name, task source, benchmark filter, configs, model, seeds, category. | |
| 16 | +| **Task source** | Canonical task list (e.g. `configs/selected_benchmark_tasks.json`). | |
| 17 | +| **Manifest generator** | Script that reads spec + task source, sorts tasks deterministically, computes IDs via `lib.matrix.id_generator`, and writes a **run manifest** (JSON). | |
| 18 | +| **Runner** | Existing `run_selected_tasks.sh` or a manifest-driven wrapper; executes each run from the manifest so ordering and IDs are fixed. | |
| 19 | + |
| 20 | +## Control plane spec (YAML) |
| 21 | + |
| 22 | +Example: `configs/control_plane_ccb.yaml` |
| 23 | + |
| 24 | +```yaml |
| 25 | +# Deterministic control plane for CodeContextBench 2-config runs. |
| 26 | +# Same file + same task source → same experiment_id and run list. |
| 27 | + |
| 28 | +experiment_name: ccb_2config |
| 29 | +description: "CCB 157 tasks × baseline + sourcegraph_full" |
| 30 | +run_category: staging |
| 31 | + |
| 32 | +# Where to get tasks (must have .tasks[].benchmark, .tasks[].task_id, .tasks[].task_dir) |
| 33 | +task_source: configs/selected_benchmark_tasks.json |
| 34 | + |
| 35 | +# Optional: limit to one benchmark (e.g. ccb_fix). Omit or empty = all benchmarks. |
| 36 | +benchmark_filter: "" |
| 37 | + |
| 38 | +models: |
| 39 | + - anthropic/claude-opus-4-6 |
| 40 | + |
| 41 | +mcp_modes: |
| 42 | + - baseline |
| 43 | + - sourcegraph_full |
| 44 | + |
| 45 | +seeds: [0] |
| 46 | +``` |
| 47 | +
|
| 48 | +- **experiment_id** is computed from `experiment_name` + hash of the spec (and optionally task source path), so it is deterministic. |
| 49 | +- **run_id** / **pair_id** use the existing `lib.matrix.id_generator` (task_id, model, mcp_mode, seed, experiment_id). |
| 50 | + |
| 51 | +## Determinism |
| 52 | + |
| 53 | +Same spec file + same task source file → same `experiment_id`, same `run_id` and `pair_id` for every run, and same ordering. The only field that changes between invocations is `generated_at` in the manifest. |
| 54 | + |
| 55 | +## Generating the manifest |
| 56 | + |
| 57 | +From the repo root: |
| 58 | + |
| 59 | +```bash |
| 60 | +# Generate manifest only (no execution) |
| 61 | +python3 scripts/control_plane.py generate --spec configs/control_plane_ccb.yaml --output runs/staging/manifest.json |
| 62 | +
|
| 63 | +# Dry-run: print what would be run |
| 64 | +python3 scripts/control_plane.py generate --spec configs/control_plane_ccb.yaml --dry-run |
| 65 | +``` |
| 66 | + |
| 67 | +The manifest JSON looks like: |
| 68 | + |
| 69 | +```json |
| 70 | +{ |
| 71 | + "experiment_id": "exp_ccb2config_2026-02-18_abc123", |
| 72 | + "experiment_name": "ccb_2config", |
| 73 | + "run_category": "staging", |
| 74 | + "generated_at": "2026-02-18T12:00:00Z", |
| 75 | + "runs": [ |
| 76 | + { |
| 77 | + "run_id": "run_baseline_opus_..._seed0_xyz", |
| 78 | + "pair_id": "pair_opus_..._seed0_...", |
| 79 | + "task_id": "...", |
| 80 | + "task_dir": "ccb_design/...", |
| 81 | + "benchmark": "ccb_design", |
| 82 | + "mcp_mode": "baseline", |
| 83 | + "model": "anthropic/claude-opus-4-6", |
| 84 | + "seed": 0 |
| 85 | + } |
| 86 | + ], |
| 87 | + "pairs": [ ... ] |
| 88 | +} |
| 89 | +``` |
| 90 | + |
| 91 | +## Using the manifest to drive runs |
| 92 | + |
| 93 | +**Option A – Keep current runner, add optional manifest mode** |
| 94 | + |
| 95 | +- Add a flag to `run_selected_tasks.sh`: e.g. `--manifest runs/staging/<experiment_id>/manifest.json`. |
| 96 | +- When `--manifest` is set, the script reads `manifest["runs"]`, iterates in order, and for each run invokes `harbor run --path ...` with the task_dir from the manifest. Output directories can include `run_id` so they are stable. |
| 97 | + |
| 98 | +**Option B – Manifest as input to a thin Python runner** |
| 99 | + |
| 100 | +- A small script (e.g. `scripts/run_from_manifest.py`) that reads the manifest and for each run calls Harbor (or shells out to the same `harbor run` logic), so all execution is manifest-driven. |
| 101 | + |
| 102 | +Either way, the **control plane** is the spec + manifest; the runner is a consumer of the manifest. |
| 103 | + |
| 104 | +## Relation to existing v2 experiment YAMLs |
| 105 | + |
| 106 | +The repo already has a **v2 experiment path** (`lib/config`, `lib/matrix/expander`, `run-eval run -c experiment.yaml`) that uses Harbor’s **registry** and dataset/task_names. That path is well-suited to benchmarks like swebenchpro that are in the registry. |
| 107 | + |
| 108 | +The **control plane layer** described here is complementary: |
| 109 | + |
| 110 | +- For **CCB in-repo tasks** (benchmarks under `benchmarks/ccb_*`), the control plane spec + manifest generator use the same deterministic ID logic (`id_generator`) but drive the **path-based** runner (`harbor run --path`), which does not require a registry. |
| 111 | +- You can later unify by having the manifest generator emit an experiment YAML (or RunSpec list) consumable by the v2 runner if CCB is ever registered in Harbor. |
| 112 | + |
| 113 | +## Checklist for a new deterministic run |
| 114 | + |
| 115 | +1. Ensure `configs/selected_benchmark_tasks.json` (or your task source) is up to date. |
| 116 | +2. Create or edit a control plane spec (e.g. `configs/control_plane_ccb.yaml`). |
| 117 | +3. Run `python3 scripts/control_plane.py generate --spec ... --output ...` to produce the manifest. |
| 118 | +4. Run the benchmark using that manifest (e.g. `run_selected_tasks.sh --manifest runs/staging/<exp_id>/manifest.json` or `scripts/run_from_manifest.py ...`). |
| 119 | +5. Post-run: `generate_manifest.py`, `generate_eval_report.py`, etc. can key off `experiment_id` and run IDs from the control plane manifest for consistent reporting. |
| 120 | + |
| 121 | +## Files |
| 122 | + |
| 123 | +| File | Purpose | |
| 124 | +|------|---------| |
| 125 | +| `docs/CONTROL_PLANE.md` | This design and usage. | |
| 126 | +| `configs/control_plane_ccb.yaml` | Example control plane spec for CCB 2-config. | |
| 127 | +| `scripts/control_plane.py` | Manifest generator: spec + task source → manifest JSON. | |
| 128 | +| `lib/matrix/id_generator.py` | Deterministic experiment_id, run_id, pair_id (unchanged). | |
0 commit comments