|
| 1 | +# Handoff: Re-Curation IR Analysis |
| 2 | + |
| 3 | +## Context |
| 4 | +We re-curated ground truth for 311/367 benchmark tasks using a calibrated curator agent (Opus 4.6, phase1 prompt, hybrid backend). The new ground truth files are `_agent` variants that exist alongside the original manually-authored files. |
| 5 | + |
| 6 | +**Commit**: `dd4d62eec3` — "Add calibrated curator ground truth (311/367) and harden Daytona sandbox lifecycle" |
| 7 | + |
| 8 | +## What Was Done |
| 9 | +- **Org: 207/207 complete** — all tasks have `oracle_answer_agent.json` in `benchmarks/csb_org_*/*/tests/` |
| 10 | +- **SDLC: 104/160 complete** — tasks have `ground_truth_agent.json` in `benchmarks/csb_sdlc_*/*/tests/` |
| 11 | +- **56 SDLC tasks still missing** — blocked by OAuth rate limits (Accounts 2+3 limited until Mar 6 3am UTC, Account 1 available) |
| 12 | +- Missing SDLC concentrated in: `test` (16), `understand` (11), `debug` (10, 4 are known linux `--branch` parse bugs), `secure` (6), `document` (4), `feature` (4), `refactor` (2), `fix` (2), `design` (1) |
| 13 | + |
| 14 | +## Files Modified |
| 15 | +- `scripts/daytona_curator_runner.py` — hardened with orphan cleanup, auto-stop, signal handler, parallel=55 default |
| 16 | +- `benchmarks/csb_org_*/*/tests/oracle_answer_agent.json` — 207 new curator-generated Org oracle files |
| 17 | +- `benchmarks/csb_org_*/*/tests/ground_truth.json` — 207 updated (curator also writes canonical for Org) |
| 18 | +- `benchmarks/csb_org_*/*/tests/ground_truth_meta.json` — 207 metadata files |
| 19 | +- `benchmarks/csb_sdlc_*/*/tests/ground_truth_agent.json` — 104 new curator-generated SDLC ground truth files |
| 20 | +- `benchmarks/csb_sdlc_*/*/tests/ground_truth_meta.json` — 104 metadata files |
| 21 | + |
| 22 | +## Task 1: Complete Remaining 56 SDLC Tasks |
| 23 | + |
| 24 | +Account 1 is available. Run: |
| 25 | +```bash |
| 26 | +source .env.local && export HARBOR_ENV=daytona DAYTONA_OVERRIDE_STORAGE=10240 CCB_ACCOUNT=1 |
| 27 | +python3 scripts/daytona_curator_runner.py \ |
| 28 | + --sdlc-all --skip-agent-variants \ |
| 29 | + --model claude-opus-4-6 --backend hybrid --prompt-version phase1 \ |
| 30 | + --parallel 55 |
| 31 | +``` |
| 32 | + |
| 33 | +After completion, 4 linux kernel tasks will still fail (`linux-acpi-backlight-fault-001`, `linux-hda-intel-suspend-fault-001`, `linux-iwlwifi-subdevice-fault-001`, `linux-nfs-inode-revalidate-fault-001`) — their Dockerfiles use `git clone --branch` which gets parsed as a repo slug. These need manual ground truth. |
| 34 | + |
| 35 | +## Task 2: Promote Agent Oracles |
| 36 | + |
| 37 | +After all tasks complete, promote `_agent` variants to canonical: |
| 38 | +```bash |
| 39 | +python3 scripts/promote_agent_oracles.py --force |
| 40 | +``` |
| 41 | + |
| 42 | +This replaces `ground_truth.json` / `oracle_answer.json` with the calibrated `_agent` versions. |
| 43 | + |
| 44 | +## Task 3: Re-Run IR Analysis |
| 45 | + |
| 46 | +The IR evaluation pipeline reads: |
| 47 | +- SDLC: `ground_truth.json` (so promotion must happen first) |
| 48 | +- Org: `oracle_answer.json` first, then `ground_truth.json` fallback |
| 49 | + |
| 50 | +After promotion, regenerate the IR analysis: |
| 51 | +```bash |
| 52 | +# Normalize retrieval events from all official runs |
| 53 | +python3 scripts/normalize_retrieval_events.py --runs-dir runs/official/ |
| 54 | + |
| 55 | +# Evaluate IR metrics against new ground truth |
| 56 | +python3 scripts/compute_retrieval_metrics.py --runs-dir runs/official/ --output results/ir/ |
| 57 | + |
| 58 | +# Generate the V2 report with updated IR numbers |
| 59 | +python3 scripts/extract_v2_report_data.py |
| 60 | +``` |
| 61 | + |
| 62 | +Key metrics to compare before/after promotion: |
| 63 | +- Per-suite F1, precision, recall |
| 64 | +- Baseline vs SG_full delta (does MCP advantage change with better ground truth?) |
| 65 | +- Overall aggregate F1 |
| 66 | + |
| 67 | +## Task 4: Quality Spot-Check (Before Promotion) |
| 68 | + |
| 69 | +Before promoting, spot-check a sample of `_agent` vs canonical ground truth: |
| 70 | +```bash |
| 71 | +# Pick 5 random tasks and compare file lists |
| 72 | +for f in $(find benchmarks/csb_sdlc_* -name ground_truth_agent.json | shuf | head -5); do |
| 73 | + canonical=$(dirname "$f")/ground_truth.json |
| 74 | + echo "=== $(basename $(dirname $(dirname $f))) ===" |
| 75 | + echo "Canonical files: $(python3 -c "import json; print(len(json.load(open('$canonical')).get('expected_files', [])))" 2>/dev/null || echo "N/A")" |
| 76 | + echo "Agent files: $(python3 -c "import json; print(len(json.load(open('$f')).get('expected_files', [])))")" |
| 77 | + echo "" |
| 78 | +done |
| 79 | +``` |
| 80 | + |
| 81 | +Look for: |
| 82 | +- Agent producing 0 or 1 files (regex rescue, low quality) — should re-run |
| 83 | +- Agent producing 50+ files (over-inclusion) — may need review |
| 84 | +- Canonical having files the agent missed (recall regression) |
| 85 | + |
| 86 | +## Key Architecture Notes |
| 87 | + |
| 88 | +- `write_curator_outputs()` in `context_retrieval_agent.py` handles both SDLC and Org file writing |
| 89 | +- When `overwrite=False` (default), writes `_agent` variants; when `overwrite=True`, writes canonical |
| 90 | +- `ground_truth_meta.json` contains curator metadata: model, backend, prompt version, cost, timestamp |
| 91 | +- The curator uses phase1 prompt (`PHASE1_CLI_PROMPTS` + `PHASE1_SUFFIX`) which is recall-focused (F1=0.749 on calibration set) |
| 92 | +- Hybrid backend = local tools (Bash, Read, Glob, Grep) + Sourcegraph MCP (sg_keyword_search, sg_nls_search) |
| 93 | + |
| 94 | +## Daytona Runner Changes (for reference) |
| 95 | + |
| 96 | +The runner was hardened in this session to prevent orphaned sandbox accumulation: |
| 97 | +1. `cleanup_orphaned_sandboxes()` runs at startup and shutdown |
| 98 | +2. `auto_stop_interval=20` (minutes) — sandboxes auto-stop if idle |
| 99 | +3. `auto_archive_interval=60` — auto-archive after 1 hour |
| 100 | +4. SIGTERM/SIGINT signal handler cancels futures and triggers cleanup |
| 101 | +5. `DEFAULT_PARALLEL=55` (was 20) — matches Tier 3 capacity (250 vCPU / 2 per sandbox = 125 max, minus headroom) |
0 commit comments