Skip to content

Commit 10eeb5b

Browse files
sjarmakclaude
andcommitted
Add handoff for re-curation IR analysis workflow
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent dd4d62e commit 10eeb5b

File tree

1 file changed

+101
-0
lines changed

1 file changed

+101
-0
lines changed
Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
# Handoff: Re-Curation IR Analysis
2+
3+
## Context
4+
We re-curated ground truth for 311/367 benchmark tasks using a calibrated curator agent (Opus 4.6, phase1 prompt, hybrid backend). The new ground truth files are `_agent` variants that exist alongside the original manually-authored files.
5+
6+
**Commit**: `dd4d62eec3` — "Add calibrated curator ground truth (311/367) and harden Daytona sandbox lifecycle"
7+
8+
## What Was Done
9+
- **Org: 207/207 complete** — all tasks have `oracle_answer_agent.json` in `benchmarks/csb_org_*/*/tests/`
10+
- **SDLC: 104/160 complete** — tasks have `ground_truth_agent.json` in `benchmarks/csb_sdlc_*/*/tests/`
11+
- **56 SDLC tasks still missing** — blocked by OAuth rate limits (Accounts 2+3 limited until Mar 6 3am UTC, Account 1 available)
12+
- Missing SDLC concentrated in: `test` (16), `understand` (11), `debug` (10, 4 are known linux `--branch` parse bugs), `secure` (6), `document` (4), `feature` (4), `refactor` (2), `fix` (2), `design` (1)
13+
14+
## Files Modified
15+
- `scripts/daytona_curator_runner.py` — hardened with orphan cleanup, auto-stop, signal handler, parallel=55 default
16+
- `benchmarks/csb_org_*/*/tests/oracle_answer_agent.json` — 207 new curator-generated Org oracle files
17+
- `benchmarks/csb_org_*/*/tests/ground_truth.json` — 207 updated (curator also writes canonical for Org)
18+
- `benchmarks/csb_org_*/*/tests/ground_truth_meta.json` — 207 metadata files
19+
- `benchmarks/csb_sdlc_*/*/tests/ground_truth_agent.json` — 104 new curator-generated SDLC ground truth files
20+
- `benchmarks/csb_sdlc_*/*/tests/ground_truth_meta.json` — 104 metadata files
21+
22+
## Task 1: Complete Remaining 56 SDLC Tasks
23+
24+
Account 1 is available. Run:
25+
```bash
26+
source .env.local && export HARBOR_ENV=daytona DAYTONA_OVERRIDE_STORAGE=10240 CCB_ACCOUNT=1
27+
python3 scripts/daytona_curator_runner.py \
28+
--sdlc-all --skip-agent-variants \
29+
--model claude-opus-4-6 --backend hybrid --prompt-version phase1 \
30+
--parallel 55
31+
```
32+
33+
After completion, 4 linux kernel tasks will still fail (`linux-acpi-backlight-fault-001`, `linux-hda-intel-suspend-fault-001`, `linux-iwlwifi-subdevice-fault-001`, `linux-nfs-inode-revalidate-fault-001`) — their Dockerfiles use `git clone --branch` which gets parsed as a repo slug. These need manual ground truth.
34+
35+
## Task 2: Promote Agent Oracles
36+
37+
After all tasks complete, promote `_agent` variants to canonical:
38+
```bash
39+
python3 scripts/promote_agent_oracles.py --force
40+
```
41+
42+
This replaces `ground_truth.json` / `oracle_answer.json` with the calibrated `_agent` versions.
43+
44+
## Task 3: Re-Run IR Analysis
45+
46+
The IR evaluation pipeline reads:
47+
- SDLC: `ground_truth.json` (so promotion must happen first)
48+
- Org: `oracle_answer.json` first, then `ground_truth.json` fallback
49+
50+
After promotion, regenerate the IR analysis:
51+
```bash
52+
# Normalize retrieval events from all official runs
53+
python3 scripts/normalize_retrieval_events.py --runs-dir runs/official/
54+
55+
# Evaluate IR metrics against new ground truth
56+
python3 scripts/compute_retrieval_metrics.py --runs-dir runs/official/ --output results/ir/
57+
58+
# Generate the V2 report with updated IR numbers
59+
python3 scripts/extract_v2_report_data.py
60+
```
61+
62+
Key metrics to compare before/after promotion:
63+
- Per-suite F1, precision, recall
64+
- Baseline vs SG_full delta (does MCP advantage change with better ground truth?)
65+
- Overall aggregate F1
66+
67+
## Task 4: Quality Spot-Check (Before Promotion)
68+
69+
Before promoting, spot-check a sample of `_agent` vs canonical ground truth:
70+
```bash
71+
# Pick 5 random tasks and compare file lists
72+
for f in $(find benchmarks/csb_sdlc_* -name ground_truth_agent.json | shuf | head -5); do
73+
canonical=$(dirname "$f")/ground_truth.json
74+
echo "=== $(basename $(dirname $(dirname $f))) ==="
75+
echo "Canonical files: $(python3 -c "import json; print(len(json.load(open('$canonical')).get('expected_files', [])))" 2>/dev/null || echo "N/A")"
76+
echo "Agent files: $(python3 -c "import json; print(len(json.load(open('$f')).get('expected_files', [])))")"
77+
echo ""
78+
done
79+
```
80+
81+
Look for:
82+
- Agent producing 0 or 1 files (regex rescue, low quality) — should re-run
83+
- Agent producing 50+ files (over-inclusion) — may need review
84+
- Canonical having files the agent missed (recall regression)
85+
86+
## Key Architecture Notes
87+
88+
- `write_curator_outputs()` in `context_retrieval_agent.py` handles both SDLC and Org file writing
89+
- When `overwrite=False` (default), writes `_agent` variants; when `overwrite=True`, writes canonical
90+
- `ground_truth_meta.json` contains curator metadata: model, backend, prompt version, cost, timestamp
91+
- The curator uses phase1 prompt (`PHASE1_CLI_PROMPTS` + `PHASE1_SUFFIX`) which is recall-focused (F1=0.749 on calibration set)
92+
- Hybrid backend = local tools (Bash, Read, Glob, Grep) + Sourcegraph MCP (sg_keyword_search, sg_nls_search)
93+
94+
## Daytona Runner Changes (for reference)
95+
96+
The runner was hardened in this session to prevent orphaned sandbox accumulation:
97+
1. `cleanup_orphaned_sandboxes()` runs at startup and shutdown
98+
2. `auto_stop_interval=20` (minutes) — sandboxes auto-stop if idle
99+
3. `auto_archive_interval=60` — auto-archive after 1 hour
100+
4. SIGTERM/SIGINT signal handler cancels futures and triggers cleanup
101+
5. `DEFAULT_PARALLEL=55` (was 20) — matches Tier 3 capacity (250 vCPU / 2 per sandbox = 125 max, minus headroom)

0 commit comments

Comments
 (0)