Skip to content

Commit b768d6a

Browse files
sjarmakclaude
andcommitted
docs: add handoff for Daytona-parallelized curator batch run
56 SDLC tasks still need ground_truth.json. This handoff documents the gaps in daytona_curator_runner.py (built for ContextBench, not SDLC) and the architecture for refactoring it to support benchmarks/ directory tasks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 26f793e commit b768d6a

File tree

1 file changed

+181
-0
lines changed

1 file changed

+181
-0
lines changed
Lines changed: 181 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,181 @@
1+
# Handoff: Parallelize Curator Ground Truth Generation via Daytona
2+
3+
## Goal
4+
Refactor `scripts/daytona_curator_runner.py` to support CodeScaleBench SDLC tasks (not just ContextBench), then run the 56 remaining missing ground truth tasks in parallel via Daytona sandboxes.
5+
6+
## Current State (as of 2026-03-03 16:00 UTC)
7+
8+
### What's done
9+
- Phase1 curator prompt restored in `scripts/context_retrieval_agent.py` (commit 63c9ec401)
10+
- `PHASE1_CLI_PROMPTS` + `PHASE1_SUFFIX`: per-backend recall-focused prompts
11+
- `--prompt-version phase1` flag (default), scored F1=0.749 on ContextBench calibration
12+
- `get_phase1_system_prompt(backend)` and `get_phase1_allowed_tools(backend)` helper functions
13+
- Sequential run completed 18/74 SDLC tasks before being killed (committed in 26f793e95)
14+
- All 221 Org tasks already have oracle files (0 missing)
15+
- `_resolve_repos()` enhanced with Strategy 3: parses `# Repo:` comments from SWEAP Dockerfiles
16+
- `daytona_curator_runner.py` already has `--prompt-version` flag plumbed through
17+
18+
### What's left: 56 SDLC tasks missing ground_truth.json
19+
- 39 tasks have `git clone` URLs in Dockerfile (sg-evals mirror repos)
20+
- 6 tasks have `# Repo:` comment in Dockerfile (SWEAP images, e.g. element-hq/element-web)
21+
- 9 tasks have `# Source: org/repo (commit)` in Dockerfile (SWEAP debug tasks)
22+
- 2 tasks use TAC images (ghcr.io/theagentcompany/...) with no repo reference
23+
24+
Breakdown by suite:
25+
```
26+
15 csb_sdlc_test (code reviews, coverage gaps, unit gen)
27+
10 csb_sdlc_fix (pytorch, teleport, terraform, webclients)
28+
9 csb_sdlc_debug (ansible, flipt, qutebrowser, teleport, tutanota, vuls)
29+
8 csb_sdlc_secure (django, flipt)
30+
4 csb_sdlc_feature (bustub, postgres, servo, vscode)
31+
4 csb_sdlc_design (django, etcd, flipt)
32+
3 csb_sdlc_understand (django, numpy)
33+
3 csb_sdlc_refactor (flipt, python-http)
34+
```
35+
36+
## The Problem: daytona_curator_runner.py Doesn't Support SDLC Tasks
37+
38+
The current Daytona runner was built for ContextBench validation (loads tasks from HuggingFace parquet, writes trajectory files). To support SDLC ground truth generation, it needs these changes:
39+
40+
### 1. Task Discovery (currently: parquet → need: benchmarks/ directories)
41+
- Replace `load_tasks()` (parquet) with `discover_tasks(sdlc_all=True)` from context_retrieval_agent.py
42+
- Add `--sdlc-all`, `--mcp-all`, `--suite`, `--task-dir` flags (mirror context_retrieval_agent.py's CLI)
43+
- Add `--missing-only` flag to skip tasks that already have ground_truth.json
44+
- Each task is a `Path` to `benchmarks/csb_sdlc_*/{task_name}/`
45+
46+
### 2. Task Context Loading (currently: problem_statement field → need: parse_task_for_curator)
47+
- Call `parse_task_for_curator(task_dir)` for each task to get:
48+
- `instruction` (from instruction.md), `seed_prompt` (from task_spec.json)
49+
- `suite_name` (for curator profile selection)
50+
- `test_sh_diff_targets` (expected edit files from test.sh git diff)
51+
- `repo_urls` (from Dockerfile git clone commands)
52+
- `task_type` (sdlc vs mcp_unique)
53+
54+
### 3. Repo Cloning in Sandbox (currently: single repo_url → need: multi-strategy)
55+
- `_resolve_repos(ctx, cache_dir)` handles three strategies:
56+
1. Repo fixture (from task_spec.json repo_set_id → fixtures/repo_sets/*.json)
57+
2. Dockerfile git clone URLs (parsed via `_extract_clone_urls()`)
58+
3. `# Repo:` comment in Dockerfile (SWEAP images)
59+
4. `# Source: org/repo (commit)` in Dockerfile (SWEAP debug tasks — **needs adding**)
60+
- In Daytona sandbox: clone each repo to /workspace/{repo_name}/
61+
- For tasks with sg-evals mirrors: clone `https://github.com/sg-evals/{repo}--{commit}.git`
62+
63+
### 4. User Message (currently: hardcoded → need: build_user_message)
64+
- Call `build_user_message(ctx, repo_paths)` which includes:
65+
- Task description from instruction.md or seed_prompt
66+
- Repo paths mapped to sandbox locations
67+
- Verifier targets from test.sh (helps agent find expected_edit_files)
68+
- Deep Search repo name hints
69+
70+
### 5. Output Writing (currently: trajectory JSON → need: write_curator_outputs)
71+
- Call `write_curator_outputs(task_dir, oracle, metadata, ctx, overwrite=True)` after each task
72+
- This writes to the task's own `tests/` directory:
73+
- `ground_truth.json` (IR pipeline format — files, symbols, expected_edit_files, chunks)
74+
- `ground_truth_meta.json` (confidence, cost, timing sidecar)
75+
- **Important**: The Daytona sandbox writes files *inside the sandbox*. You need to either:
76+
- (A) Upload task context TO sandbox, run curator, download results back to host, then call `write_curator_outputs` locally
77+
- (B) Or have the sandbox write to a mounted volume / collect results via `sandbox.process.exec` output
78+
79+
### 6. Recommended Architecture
80+
81+
The simplest approach: **keep repo cloning and curator execution in Daytona, but collect results back to the host for writing**.
82+
83+
```python
84+
def process_sdlc_task(task_dir, idx, total, daytona_client, creds, model, backend, prompt_version):
85+
# 1. Parse task locally
86+
ctx = parse_task_for_curator(task_dir)
87+
88+
# 2. Determine repo URL(s) for sandbox cloning
89+
repo_info = _extract_repo_info_for_sandbox(ctx) # NEW: extract URL+commit for sandbox cloning
90+
91+
# 3. Create sandbox, clone repo(s), install tools
92+
sandbox = setup_curator_sandbox(daytona_client, creds, repo_info, model, backend)
93+
94+
# 4. Build user message with sandbox repo paths
95+
sandbox_repo_paths = {name: Path(f"/workspace/{name}") for name in repo_info}
96+
user_msg = build_user_message(ctx, sandbox_repo_paths)
97+
98+
# 5. Run curator in sandbox
99+
result = run_curator_in_sandbox(sandbox, user_msg, creds, model, backend, prompt_version, ctx)
100+
101+
# 6. Write results locally (not in sandbox)
102+
if result["oracle"].get("files"):
103+
write_curator_outputs(task_dir, result["oracle"], result["metadata"], ctx, overwrite=True)
104+
105+
# 7. Cleanup sandbox
106+
daytona_client.delete(sandbox)
107+
return result
108+
```
109+
110+
## Key Files to Read
111+
112+
```
113+
scripts/context_retrieval_agent.py # The full curator — prompts, tools, CLI runner, output writing
114+
- PHASE1_CLI_PROMPTS (line ~870) # The phase1 prompt constants
115+
- get_phase1_system_prompt() # Helper to get prompt for backend
116+
- get_phase1_allowed_tools() # Helper to get tools for backend
117+
- parse_task_for_curator() # Phase 0 task parsing
118+
- build_user_message() # User message construction
119+
- _resolve_repos() # Multi-strategy repo resolution
120+
- write_curator_outputs() # Ground truth file writing
121+
- _convert_to_ir_schema() # Oracle → ground_truth.json conversion
122+
- _extract_json_from_text() # Parse oracle JSON from CLI output
123+
124+
scripts/daytona_curator_runner.py # Current Daytona runner (ContextBench-only)
125+
- setup_curator_sandbox() # Sandbox creation + tool installation
126+
- _build_python_runner() # Python wrapper avoiding shell quoting
127+
- run_curator_in_sandbox() # Prompt injection + CLI execution
128+
- process_task() # Per-task orchestration
129+
130+
docs/DAYTONA.md # Daytona environment reference
131+
```
132+
133+
## Environment Setup
134+
135+
```bash
136+
source .env.local # Sets DAYTONA_API_KEY, SOURCEGRAPH_ACCESS_TOKEN, OAuth creds
137+
138+
# Required env vars:
139+
# DAYTONA_API_KEY — Daytona API key (Tier 3: 125 sandbox limit)
140+
# SOURCEGRAPH_ACCESS_TOKEN — For SG keyword + NLS search in hybrid backend
141+
# OAuth creds at ~/.claude-homes/account1/.claude/.credentials.json
142+
143+
# Verify:
144+
python3 -c "import daytona_sdk; print('OK')"
145+
```
146+
147+
## Execution Parameters
148+
149+
- **Parallelism**: 20 concurrent sandboxes (conservative for long-running curator tasks)
150+
- **Model**: claude-opus-4-6 (subscription billing via OAuth, not API key)
151+
- **Backend**: hybrid (local tools + SG keyword + SG NLS search)
152+
- **Prompt**: phase1 (default, recall-focused)
153+
- **Timeout**: 900s per task (15 min, same as sequential)
154+
- **Expected cost**: ~$0.66/task × 56 = ~$37 (subscription, effectively $0)
155+
- **Expected time**: ~30 min with 20 parallel (vs ~4 hours sequential)
156+
157+
## Verification After Run
158+
159+
```bash
160+
# Check coverage
161+
python3 scripts/context_retrieval_agent.py --sdlc-all --missing-only --dry-run
162+
# Should show: Total: 0 tasks
163+
164+
# Spot-check a few outputs
165+
cat benchmarks/csb_sdlc_debug/flipt-auth-cookie-regression-prove-001/tests/ground_truth.json | python3 -m json.tool | head -20
166+
167+
# Commit results
168+
git add benchmarks/csb_sdlc_*/tests/ground_truth*.json
169+
git commit -m "chore: commit ground truth from Daytona parallel curator run"
170+
git push
171+
```
172+
173+
## Edge Cases
174+
175+
1. **TAC image tasks** (bustub, openhands): No repo URL in Dockerfile. Need manual repo resolution or skip. The bustub task uses `ghcr.io/theagentcompany/sde-implement-hyperloglog-image:1.0.0` — the repo is likely `cmu-db/bustub`. The openhands task needs `All-Hands-AI/OpenHands`.
176+
177+
2. **SWEAP debug tasks with `# Source:` comment**: 9 tasks have `# Source: org/repo (commit)` but no `# Repo:` line. The `_resolve_repos()` Strategy 3 only parses `# Repo:`. You need to add a Strategy 4 that parses `# Source: org/repo (commit)` and clones `https://github.com/org/repo.git` at that commit.
178+
179+
3. **Large repos** (pytorch, ansible, teleport): These take 5-10 min to clone. Consider pre-warming the Daytona sandbox image with common repos, or cloning locally first and uploading.
180+
181+
4. **Rate limiting**: With 20 concurrent `claude -p` calls, monitor for rate limit errors. The sequential run had 0 errors in 18 tasks at ~$0.66/task avg.

0 commit comments

Comments
 (0)