|
| 1 | +# Handoff: Parallelize Curator Ground Truth Generation via Daytona |
| 2 | + |
| 3 | +## Goal |
| 4 | +Refactor `scripts/daytona_curator_runner.py` to support CodeScaleBench SDLC tasks (not just ContextBench), then run the 56 remaining missing ground truth tasks in parallel via Daytona sandboxes. |
| 5 | + |
| 6 | +## Current State (as of 2026-03-03 16:00 UTC) |
| 7 | + |
| 8 | +### What's done |
| 9 | +- Phase1 curator prompt restored in `scripts/context_retrieval_agent.py` (commit 63c9ec401) |
| 10 | + - `PHASE1_CLI_PROMPTS` + `PHASE1_SUFFIX`: per-backend recall-focused prompts |
| 11 | + - `--prompt-version phase1` flag (default), scored F1=0.749 on ContextBench calibration |
| 12 | + - `get_phase1_system_prompt(backend)` and `get_phase1_allowed_tools(backend)` helper functions |
| 13 | +- Sequential run completed 18/74 SDLC tasks before being killed (committed in 26f793e95) |
| 14 | +- All 221 Org tasks already have oracle files (0 missing) |
| 15 | +- `_resolve_repos()` enhanced with Strategy 3: parses `# Repo:` comments from SWEAP Dockerfiles |
| 16 | +- `daytona_curator_runner.py` already has `--prompt-version` flag plumbed through |
| 17 | + |
| 18 | +### What's left: 56 SDLC tasks missing ground_truth.json |
| 19 | +- 39 tasks have `git clone` URLs in Dockerfile (sg-evals mirror repos) |
| 20 | +- 6 tasks have `# Repo:` comment in Dockerfile (SWEAP images, e.g. element-hq/element-web) |
| 21 | +- 9 tasks have `# Source: org/repo (commit)` in Dockerfile (SWEAP debug tasks) |
| 22 | +- 2 tasks use TAC images (ghcr.io/theagentcompany/...) with no repo reference |
| 23 | + |
| 24 | +Breakdown by suite: |
| 25 | +``` |
| 26 | +15 csb_sdlc_test (code reviews, coverage gaps, unit gen) |
| 27 | +10 csb_sdlc_fix (pytorch, teleport, terraform, webclients) |
| 28 | + 9 csb_sdlc_debug (ansible, flipt, qutebrowser, teleport, tutanota, vuls) |
| 29 | + 8 csb_sdlc_secure (django, flipt) |
| 30 | + 4 csb_sdlc_feature (bustub, postgres, servo, vscode) |
| 31 | + 4 csb_sdlc_design (django, etcd, flipt) |
| 32 | + 3 csb_sdlc_understand (django, numpy) |
| 33 | + 3 csb_sdlc_refactor (flipt, python-http) |
| 34 | +``` |
| 35 | + |
| 36 | +## The Problem: daytona_curator_runner.py Doesn't Support SDLC Tasks |
| 37 | + |
| 38 | +The current Daytona runner was built for ContextBench validation (loads tasks from HuggingFace parquet, writes trajectory files). To support SDLC ground truth generation, it needs these changes: |
| 39 | + |
| 40 | +### 1. Task Discovery (currently: parquet → need: benchmarks/ directories) |
| 41 | +- Replace `load_tasks()` (parquet) with `discover_tasks(sdlc_all=True)` from context_retrieval_agent.py |
| 42 | +- Add `--sdlc-all`, `--mcp-all`, `--suite`, `--task-dir` flags (mirror context_retrieval_agent.py's CLI) |
| 43 | +- Add `--missing-only` flag to skip tasks that already have ground_truth.json |
| 44 | +- Each task is a `Path` to `benchmarks/csb_sdlc_*/{task_name}/` |
| 45 | + |
| 46 | +### 2. Task Context Loading (currently: problem_statement field → need: parse_task_for_curator) |
| 47 | +- Call `parse_task_for_curator(task_dir)` for each task to get: |
| 48 | + - `instruction` (from instruction.md), `seed_prompt` (from task_spec.json) |
| 49 | + - `suite_name` (for curator profile selection) |
| 50 | + - `test_sh_diff_targets` (expected edit files from test.sh git diff) |
| 51 | + - `repo_urls` (from Dockerfile git clone commands) |
| 52 | + - `task_type` (sdlc vs mcp_unique) |
| 53 | + |
| 54 | +### 3. Repo Cloning in Sandbox (currently: single repo_url → need: multi-strategy) |
| 55 | +- `_resolve_repos(ctx, cache_dir)` handles three strategies: |
| 56 | + 1. Repo fixture (from task_spec.json repo_set_id → fixtures/repo_sets/*.json) |
| 57 | + 2. Dockerfile git clone URLs (parsed via `_extract_clone_urls()`) |
| 58 | + 3. `# Repo:` comment in Dockerfile (SWEAP images) |
| 59 | + 4. `# Source: org/repo (commit)` in Dockerfile (SWEAP debug tasks — **needs adding**) |
| 60 | +- In Daytona sandbox: clone each repo to /workspace/{repo_name}/ |
| 61 | +- For tasks with sg-evals mirrors: clone `https://github.com/sg-evals/{repo}--{commit}.git` |
| 62 | + |
| 63 | +### 4. User Message (currently: hardcoded → need: build_user_message) |
| 64 | +- Call `build_user_message(ctx, repo_paths)` which includes: |
| 65 | + - Task description from instruction.md or seed_prompt |
| 66 | + - Repo paths mapped to sandbox locations |
| 67 | + - Verifier targets from test.sh (helps agent find expected_edit_files) |
| 68 | + - Deep Search repo name hints |
| 69 | + |
| 70 | +### 5. Output Writing (currently: trajectory JSON → need: write_curator_outputs) |
| 71 | +- Call `write_curator_outputs(task_dir, oracle, metadata, ctx, overwrite=True)` after each task |
| 72 | +- This writes to the task's own `tests/` directory: |
| 73 | + - `ground_truth.json` (IR pipeline format — files, symbols, expected_edit_files, chunks) |
| 74 | + - `ground_truth_meta.json` (confidence, cost, timing sidecar) |
| 75 | +- **Important**: The Daytona sandbox writes files *inside the sandbox*. You need to either: |
| 76 | + - (A) Upload task context TO sandbox, run curator, download results back to host, then call `write_curator_outputs` locally |
| 77 | + - (B) Or have the sandbox write to a mounted volume / collect results via `sandbox.process.exec` output |
| 78 | + |
| 79 | +### 6. Recommended Architecture |
| 80 | + |
| 81 | +The simplest approach: **keep repo cloning and curator execution in Daytona, but collect results back to the host for writing**. |
| 82 | + |
| 83 | +```python |
| 84 | +def process_sdlc_task(task_dir, idx, total, daytona_client, creds, model, backend, prompt_version): |
| 85 | + # 1. Parse task locally |
| 86 | + ctx = parse_task_for_curator(task_dir) |
| 87 | + |
| 88 | + # 2. Determine repo URL(s) for sandbox cloning |
| 89 | + repo_info = _extract_repo_info_for_sandbox(ctx) # NEW: extract URL+commit for sandbox cloning |
| 90 | + |
| 91 | + # 3. Create sandbox, clone repo(s), install tools |
| 92 | + sandbox = setup_curator_sandbox(daytona_client, creds, repo_info, model, backend) |
| 93 | + |
| 94 | + # 4. Build user message with sandbox repo paths |
| 95 | + sandbox_repo_paths = {name: Path(f"/workspace/{name}") for name in repo_info} |
| 96 | + user_msg = build_user_message(ctx, sandbox_repo_paths) |
| 97 | + |
| 98 | + # 5. Run curator in sandbox |
| 99 | + result = run_curator_in_sandbox(sandbox, user_msg, creds, model, backend, prompt_version, ctx) |
| 100 | + |
| 101 | + # 6. Write results locally (not in sandbox) |
| 102 | + if result["oracle"].get("files"): |
| 103 | + write_curator_outputs(task_dir, result["oracle"], result["metadata"], ctx, overwrite=True) |
| 104 | + |
| 105 | + # 7. Cleanup sandbox |
| 106 | + daytona_client.delete(sandbox) |
| 107 | + return result |
| 108 | +``` |
| 109 | + |
| 110 | +## Key Files to Read |
| 111 | + |
| 112 | +``` |
| 113 | +scripts/context_retrieval_agent.py # The full curator — prompts, tools, CLI runner, output writing |
| 114 | + - PHASE1_CLI_PROMPTS (line ~870) # The phase1 prompt constants |
| 115 | + - get_phase1_system_prompt() # Helper to get prompt for backend |
| 116 | + - get_phase1_allowed_tools() # Helper to get tools for backend |
| 117 | + - parse_task_for_curator() # Phase 0 task parsing |
| 118 | + - build_user_message() # User message construction |
| 119 | + - _resolve_repos() # Multi-strategy repo resolution |
| 120 | + - write_curator_outputs() # Ground truth file writing |
| 121 | + - _convert_to_ir_schema() # Oracle → ground_truth.json conversion |
| 122 | + - _extract_json_from_text() # Parse oracle JSON from CLI output |
| 123 | +
|
| 124 | +scripts/daytona_curator_runner.py # Current Daytona runner (ContextBench-only) |
| 125 | + - setup_curator_sandbox() # Sandbox creation + tool installation |
| 126 | + - _build_python_runner() # Python wrapper avoiding shell quoting |
| 127 | + - run_curator_in_sandbox() # Prompt injection + CLI execution |
| 128 | + - process_task() # Per-task orchestration |
| 129 | +
|
| 130 | +docs/DAYTONA.md # Daytona environment reference |
| 131 | +``` |
| 132 | + |
| 133 | +## Environment Setup |
| 134 | + |
| 135 | +```bash |
| 136 | +source .env.local # Sets DAYTONA_API_KEY, SOURCEGRAPH_ACCESS_TOKEN, OAuth creds |
| 137 | + |
| 138 | +# Required env vars: |
| 139 | +# DAYTONA_API_KEY — Daytona API key (Tier 3: 125 sandbox limit) |
| 140 | +# SOURCEGRAPH_ACCESS_TOKEN — For SG keyword + NLS search in hybrid backend |
| 141 | +# OAuth creds at ~/.claude-homes/account1/.claude/.credentials.json |
| 142 | + |
| 143 | +# Verify: |
| 144 | +python3 -c "import daytona_sdk; print('OK')" |
| 145 | +``` |
| 146 | + |
| 147 | +## Execution Parameters |
| 148 | + |
| 149 | +- **Parallelism**: 20 concurrent sandboxes (conservative for long-running curator tasks) |
| 150 | +- **Model**: claude-opus-4-6 (subscription billing via OAuth, not API key) |
| 151 | +- **Backend**: hybrid (local tools + SG keyword + SG NLS search) |
| 152 | +- **Prompt**: phase1 (default, recall-focused) |
| 153 | +- **Timeout**: 900s per task (15 min, same as sequential) |
| 154 | +- **Expected cost**: ~$0.66/task × 56 = ~$37 (subscription, effectively $0) |
| 155 | +- **Expected time**: ~30 min with 20 parallel (vs ~4 hours sequential) |
| 156 | + |
| 157 | +## Verification After Run |
| 158 | + |
| 159 | +```bash |
| 160 | +# Check coverage |
| 161 | +python3 scripts/context_retrieval_agent.py --sdlc-all --missing-only --dry-run |
| 162 | +# Should show: Total: 0 tasks |
| 163 | + |
| 164 | +# Spot-check a few outputs |
| 165 | +cat benchmarks/csb_sdlc_debug/flipt-auth-cookie-regression-prove-001/tests/ground_truth.json | python3 -m json.tool | head -20 |
| 166 | + |
| 167 | +# Commit results |
| 168 | +git add benchmarks/csb_sdlc_*/tests/ground_truth*.json |
| 169 | +git commit -m "chore: commit ground truth from Daytona parallel curator run" |
| 170 | +git push |
| 171 | +``` |
| 172 | + |
| 173 | +## Edge Cases |
| 174 | + |
| 175 | +1. **TAC image tasks** (bustub, openhands): No repo URL in Dockerfile. Need manual repo resolution or skip. The bustub task uses `ghcr.io/theagentcompany/sde-implement-hyperloglog-image:1.0.0` — the repo is likely `cmu-db/bustub`. The openhands task needs `All-Hands-AI/OpenHands`. |
| 176 | + |
| 177 | +2. **SWEAP debug tasks with `# Source:` comment**: 9 tasks have `# Source: org/repo (commit)` but no `# Repo:` line. The `_resolve_repos()` Strategy 3 only parses `# Repo:`. You need to add a Strategy 4 that parses `# Source: org/repo (commit)` and clones `https://github.com/org/repo.git` at that commit. |
| 178 | + |
| 179 | +3. **Large repos** (pytorch, ansible, teleport): These take 5-10 min to clone. Consider pre-warming the Daytona sandbox image with common repos, or cloning locally first and uploading. |
| 180 | + |
| 181 | +4. **Rate limiting**: With 20 concurrent `claude -p` calls, monitor for rate limit errors. The sequential run had 0 errors in 18 tasks at ~$0.66/task avg. |
0 commit comments