bd: backup 2026-03-11 23:15

sjarmak · sjarmak · commit cda9a5e4b53a · 2026-03-11T23:15:55.000Z
diff --git a/.beads/backup/backup_state.json b/.beads/backup/backup_state.json
@@ -1,10 +1,10 @@
 {
-  "last_dolt_commit": "7ng672sb3g64kpgthqdlq88jv9bap933",
+  "last_dolt_commit": "0shcgmgi71gvn6odg2qh96i3pemcfk0p",
   "last_event_id": 0,
-  "timestamp": "2026-03-11T22:53:30.363463345Z",
+  "timestamp": "2026-03-11T23:15:55.239714718Z",
   "counts": {
-    "issues": 27,
-    "events": 84,
+    "issues": 28,
+    "events": 85,
     "comments": 0,
     "dependencies": 16,
     "labels": 0,
diff --git a/.beads/backup/events.jsonl b/.beads/backup/events.jsonl
@@ -82,3 +82,4 @@
 {"actor":"sjarmak","comment":null,"created_at":"2026-03-11T01:48:02Z","event_type":"status_changed","id":82,"issue_id":"CodeScaleBench-6or","new_value":"{\"status\":\"in_progress\"}","old_value":"{\"id\":\"CodeScaleBench-6or\",\"title\":\"Dual-score reporting: paired stats and breakdowns for both dimensions\",\"description\":\"Extend reporting to show both score dimensions: (1) compute_paired_stats produces bl_reward_direct, mcp_reward_direct, delta_direct (and same for artifact); (2) breakdown_by generates per-language, per-difficulty, per-suite stats for each dimension; (3) Add correlation analysis between direct and artifact scores (do agents that edit well also describe well?). Output unified report with both dimensions.\",\"status\":\"open\",\"priority\":3,\"issue_type\":\"task\",\"owner\":\"sjarmak@users.noreply.github.com\",\"created_at\":\"2026-03-11T01:18:27Z\",\"created_by\":\"sjarmak\",\"updated_at\":\"2026-03-11T01:18:27Z\"}"}
 {"actor":"sjarmak","comment":null,"created_at":"2026-03-11T01:49:07Z","event_type":"closed","id":83,"issue_id":"CodeScaleBench-6or","new_value":"Added DUAL-SCORE ANALYSIS and DUAL-SCORE BY SUITE sections to extract_v2_report_data.py output. Shows direct vs artifact means, gap, and Pearson correlation. breakdown_by() now includes per-dimension stats (bl_mean_direct, mcp_mean_direct, delta_direct, etc.) when data available.","old_value":""}
 {"actor":"sjarmak","comment":null,"created_at":"2026-03-11T01:49:07Z","event_type":"closed","id":84,"issue_id":"CodeScaleBench-zrs","new_value":"Epic complete. 275 tasks in benchmarks/csb/ across 9 merged suites, all with dual-score verifiers. Agent instructions updated to always produce both direct edits and answer.json. Extraction and reporting pipelines extended for dual scores.","old_value":""}
+{"actor":"sjarmak","comment":null,"created_at":"2026-03-11T23:15:55Z","event_type":"created","id":85,"issue_id":"CodeScaleBench-wn8","new_value":"","old_value":""}
diff --git a/.beads/backup/issues.jsonl b/.beads/backup/issues.jsonl
@@ -22,6 +22,7 @@
 {"acceptance_criteria":"","actor":"","agent_state":"","assignee":null,"await_id":"","await_type":"","close_reason":"Fixed: OpenHands [core] TOML config + no-changes guard on 317 verifier files","closed_at":"2026-03-09T22:16:44Z","closed_by_session":"","compacted_at":null,"compacted_at_commit":null,"compaction_level":0,"content_hash":"66f93b62babd84105ef88d3725711a03f45954d04971ac834029de6b929415cd","created_at":"2026-03-09T21:53:24Z","created_by":"sjarmak","crystallizes":0,"defer_until":null,"description":"Two intertwined issues discovered during OpenHands verification batch (runs/staging/openhands_sonnet46_20260309_210054):\n\n## Issue 1: OpenHands LocalRuntime crashes on Daytona (ALL tasks)\n\nEvery task (17/18 completed) crashes with:\n```\ntenacity.RetryError in openhands/runtime/impl/local/local_runtime.py:393 _wait_until_alive\n```\nOpenHands v1.4.0 LocalRuntime tries to start jupyter-kernelgateway + action execution server on localhost. It fails to bind/connect inside Daytona sandboxes. The agent never executes any actions.\n\nPrevious successful OpenHands runs (686 results in staging) must have used a different config or environment. Need to determine what changed.\n\n## Issue 2: Verifiers produce false-positive scores when agent makes no changes\n\nelement-web-roomheaderbuttons-can-crash-fix-001 MCP scored 1.0 even though the agent crashed and made ZERO code changes. The verifier ran tests against the unmodified repo and some passed. This is a contract violation — verifiers must detect \"no agent output\" and score 0.0 before running tests.\n\nSimilarly, django-rate-limit-design-001 scored 0.05 on both configs despite the agent never running.\n\nTasks affected: all test_ratio and repo_state_heuristic verifiers that don't have a guard check for \"did the agent actually produce output.\"","design":"","due_at":null,"ephemeral":0,"estimated_minutes":null,"event_kind":"","external_ref":null,"hook_bead":"","id":"CodeScaleBench-ki9","is_template":0,"issue_type":"bug","last_activity":null,"metadata":"{}","mol_type":"","notes":"","original_size":null,"owner":"sjarmak@users.noreply.github.com","payload":"","pinned":0,"priority":1,"quality_score":null,"rig":"","role_bead":"","role_type":"","sender":"","source_repo":"","source_system":"","spec_id":"","status":"closed","target":"","timeout_ns":0,"title":"Fix OpenHands runtime crash on Daytona + investigate false-positive verifiers","updated_at":"2026-03-09T22:16:44Z","waiters":"","wisp_type":"","work_type":""}
 {"acceptance_criteria":"The readiness path for the harnesses actually in scope is documented and verified; SOURCEGRAPH_ACCESS_TOKEN is confirmed to load from .env.local for operator shells or launcher wrappers; Gemini is explicitly excluded from the immediate rerun gate; the exact commands to gate and launch the pending reruns are recorded in the issue notes or description.","actor":"","agent_state":"","assignee":null,"await_id":"","await_type":"","close_reason":"All harnesses pass readiness checks. SG token confirmed from .env.local. Gemini excluded from immediate gate.","closed_at":"2026-03-09T20:23:25Z","closed_by_session":"","compacted_at":null,"compacted_at_commit":null,"compaction_level":0,"content_hash":"7502ea94c9d230e14d139c67ca911010befb98b8cf4caa83a9a9a9710d47d945","created_at":"2026-03-09T20:19:06Z","created_by":"sjarmak","crystallizes":0,"defer_until":null,"description":"Track the operational gating work needed before treating the pending reruns as ready for harness-agnostic CI or launch checks.\n\nScope:\n- Treat active harnesses separately from the full registry-wide check when Gemini is not in scope.\n- Confirm SOURCEGRAPH_ACCESS_TOKEN is sourced from .env.local (or equivalent launcher path) before running readiness checks.\n- Validate the relevant readiness commands for the immediate rerun work, such as:\n  - python3 scripts/check_harness_readiness.py --harness codex --format json\n  - equivalent checks for other active harnesses as needed\n- Confirm the previously failed rerun workflow can be gated without requiring unrelated harness credentials.\n- Document any remaining blocker as either env setup, launcher bug, or harness-specific requirement.\n\nThis is separate from task-contract migration work and separate from the historical rerun execution/classification task already tracked in Beads.\n","design":"","due_at":null,"ephemeral":0,"estimated_minutes":null,"event_kind":"","external_ref":null,"hook_bead":"","id":"CodeScaleBench-rm3","is_template":0,"issue_type":"task","last_activity":null,"metadata":"{}","mol_type":"","notes":"Verified: all harnesses pass readiness (codex, cursor, gemini, copilot, openhands). SG token loads from .env.local (61 chars). Gemini passes but is out of scope for immediate reruns.","original_size":null,"owner":"sjarmak@users.noreply.github.com","payload":"","pinned":0,"priority":1,"quality_score":null,"rig":"","role_bead":"","role_type":"","sender":"","source_repo":"","source_system":"","spec_id":"","status":"closed","target":"","timeout_ns":0,"title":"Validate active-harness CI gating before pending rerun batches","updated_at":"2026-03-09T20:23:25Z","waiters":"","wisp_type":"","work_type":""}
 {"acceptance_criteria":"","actor":"","agent_state":"","assignee":null,"await_id":"","await_type":"","close_reason":"Built unified 280-task manifest (schema v2.0). comprehension=100, implementation=90, quality=90. Overall power=84.1% at sigma=0.20. Large codebase 58.6%, multi-repo 31.8%, 20 suites, 11 languages. LOC fallback chain eliminates all unknowns.","closed_at":"2026-03-07T23:33:05Z","closed_by_session":"","compacted_at":null,"compacted_at_commit":null,"compaction_level":0,"content_hash":"e464c7d5aa11f02b2eac40dc12bfbee707add98b6882dc3f11c7d9410edd7b71","created_at":"2026-03-07T22:56:46Z","created_by":"sjarmak","crystallizes":0,"defer_until":null,"description":"Rebuild the core benchmark manifest as a single unified set (no SDLC vs Org split). Optimize selection for: (1) 80% power for overall retrieval effect, (2) balanced task-type representation (comprehension/implementation/quality), (3) multi-repo coverage in every task type, (4) LOC band diversity with emphasis on large codebases (2M+ LOC). Target ~280-300 tasks based on power analysis. Every task has both deterministic reward and IR retrieval scoring.","design":"","due_at":null,"ephemeral":0,"estimated_minutes":null,"event_kind":"","external_ref":null,"hook_bead":"","id":"CodeScaleBench-utv","is_template":0,"issue_type":"task","last_activity":null,"metadata":"{}","mol_type":"","notes":"","original_size":null,"owner":"sjarmak@users.noreply.github.com","payload":"","pinned":0,"priority":3,"quality_score":null,"rig":"","role_bead":"","role_type":"","sender":"","source_repo":"","source_system":"","spec_id":"","status":"closed","target":"","timeout_ns":0,"title":"Rebuild unified manifest with power-optimized task-type balance","updated_at":"2026-03-07T23:33:05Z","waiters":"","wisp_type":"","work_type":""}
+{"acceptance_criteria":"","actor":"","agent_state":"","assignee":null,"await_id":"","await_type":"","close_reason":"","closed_at":null,"closed_by_session":"","compacted_at":null,"compacted_at_commit":null,"compaction_level":0,"content_hash":"789ee3d54a8665f441c400277592aca27c161525bd1eeb4610b2e285df749e5e","created_at":"2026-03-11T23:15:55Z","created_by":"sjarmak","crystallizes":0,"defer_until":null,"description":"96 Dockerfiles (24 tasks x 4 variants) reference jefzda/sweap-images on personal Docker Hub. These fail in cloud environments without Docker Hub credentials. Migrate all to ghcr.io/sg-evals/sweap-images. Affected suites: csb_sdlc_debug (ansible, qutebrowser, teleport, vuls, flipt), csb_sdlc_fix (ansible, nodebb, element-web).","design":"","due_at":null,"ephemeral":0,"estimated_minutes":null,"event_kind":"","external_ref":null,"hook_bead":"","id":"CodeScaleBench-wn8","is_template":0,"issue_type":"bug","last_activity":null,"metadata":"{}","mol_type":"","notes":"","original_size":null,"owner":"sjarmak@users.noreply.github.com","payload":"","pinned":0,"priority":1,"quality_score":null,"rig":"","role_bead":"","role_type":"","sender":"","source_repo":"","source_system":"","spec_id":"","status":"open","target":"","timeout_ns":0,"title":"Migrate 24 SWEAP tasks from jefzda/ Docker Hub to ghcr.io/sg-evals/","updated_at":"2026-03-11T23:15:55Z","waiters":"","wisp_type":"","work_type":""}
 {"acceptance_criteria":"","actor":"","agent_state":"","assignee":null,"await_id":"","await_type":"","close_reason":"Epic complete: (1) IR scoring added to SDLC tasks (ggy), (2) 67 Org tasks got deterministic verifiers (c17), (3) unified 280-task manifest built (utv). No more SDLC/Org split.","closed_at":"2026-03-07T23:33:07Z","closed_by_session":"","compacted_at":null,"compacted_at_commit":null,"compaction_level":0,"content_hash":"e3d9bf86e6f520ab604c0c7d317b708e8814f4e5505b5d360caf4591b3428e2d","created_at":"2026-03-07T22:56:15Z","created_by":"sjarmak","crystallizes":0,"defer_until":null,"description":"Converge the two halves of CodeScaleBench (SDLC with deterministic verifiers + Org with answer.json verifiers) into a single unified benchmark. Three phases: (1) add IR scoring to SDLC tasks via curator ground truth, (2) promote select Org tasks to SDLC categories with deterministic verifiers, (3) rebuild manifest optimized for multi-repo, large codebase, and task-type balance (comprehension/implementation/quality).","design":"","due_at":null,"ephemeral":0,"estimated_minutes":null,"event_kind":"","external_ref":null,"hook_bead":"","id":"CodeScaleBench-xjg","is_template":0,"issue_type":"feature","last_activity":null,"metadata":"{}","mol_type":"","notes":"","original_size":null,"owner":"sjarmak@users.noreply.github.com","payload":"","pinned":0,"priority":1,"quality_score":null,"rig":"","role_bead":"","role_type":"","sender":"","source_repo":"","source_system":"","spec_id":"","status":"closed","target":"","timeout_ns":0,"title":"[Epic] Unify SDLC + Org into single balanced benchmark","updated_at":"2026-03-07T23:33:07Z","waiters":"","wisp_type":"","work_type":""}
 {"acceptance_criteria":"","actor":"","agent_state":"","assignee":null,"await_id":"","await_type":"","close_reason":"Investigated and fixed 3 OH infrastructure bugs:\n1. pkill FileNotFoundError — guard with shutil.which(), fallback to os.system()\n2. agent_skills plugin timeout — stripped all sandbox_plugins (jupyter + agent_skills)\n3. chown -R /workspace timeout — patched installed runtime_init.py source to replace chown with no-op\n\nAlso: removed bustub-hyperloglog-impl-001 from active selection (TAC infra incompatible), fixed $DEVICE_NAME in teleport instruction.\n\nSmoke test (3 tasks paired on Daytona) passes: all baselines and MCP configs produce real scores. Ready for 12-task rerun.","closed_at":"2026-03-10T17:22:52Z","closed_by_session":"","compacted_at":null,"compacted_at_commit":null,"compaction_level":0,"content_hash":"ac89868978b54a6008a99a151b8f278d8fdc393d23b13578f18cb1bd62db75e7","created_at":"2026-03-10T11:27:18Z","created_by":"sjarmak","crystallizes":0,"defer_until":null,"description":"Three distinct infra failures need fixing before rerunning OH verification tasks:\n\n1. Harbor FileNotFoundError: django-select-for-update agent ran successfully (614 lines output, 0 crashes) but Harbor crashed writing command-2/return-code.txt. Likely Daytona sandbox cleanup race in ccb_harbor.daytona:GuardedDaytonaEnvironment.\n\n2. DinD build failure: bustub-hyperloglog baseline (Claude Haiku sentinel, csb_sdlc_feature_haiku_20260309_223654) — DinD build never completed, no task-level result dir created.\n\n3. MCP 6.5hr exception: bustub-hyperloglog MCP (same sentinel run) — ran 6.5 hours then exception_raised. flagged.json shows deepsearch_unused + only 7.86% MCP ratio.\n\nAfter fixing these, rerun all 12 tasks using configs/oh_full_rerun_20260310.json. The 9 original verification subset tasks crashed due to jupyter/fget bugs (now fixed in d0fab95). The 3 extra tasks (compliance-124, agentic-122, django-select-for-update) also need rerun. Note: 3 tasks are csb_org_* — verify OH launcher handles org tasks (prior rerun silently skipped them).\n\nAlso audit official runs for false positives from the no_changes_guard verifier bug (fixed in c5f261f):\n  grep -rl no_changes_guard runs/official/*/validation_result.json\n\nTainted runs (do NOT promote): openhands_sonnet46_20260309_{210054,223658,232947}","design":"","due_at":null,"ephemeral":0,"estimated_minutes":null,"event_kind":"","external_ref":null,"hook_bead":"","id":"CodeScaleBench-yb4","is_template":0,"issue_type":"bug","last_activity":null,"metadata":"{}","mol_type":"","notes":"## Investigation Results (2026-03-10)\n\n### Issue 1: Harbor FileNotFoundError (django-select-for-update)\n**Root cause**: NOT a Harbor/Daytona sandbox race. The actual error is `FileNotFoundError: [Errno 2] No such file or directory: 'pkill'` in `/tmp/oh_launcher.py` line 262. Some container images don't have `pkill` installed.\n**Fix**: Added `shutil.which('pkill')` guard in `agent.py` — falls back to `os.system('kill $(ps aux | ...)')` when pkill is unavailable.\n\n### Issue 2: Jupyter fget crash (AttributeError: 'list' object has no attribute 'fget')\n**Status**: Already fixed in d0fab95. Current code on main correctly uses list comprehension to filter sandbox_plugins.\n\n### Issue 3: Bustub-hyperloglog MCP 6.5hr timeout\n**Root cause**: AgentTimeoutError after hitting 24000s max. Haiku sentinel run with only 7.86% MCP usage. Task-level/model issue, not infra bug. No code fix needed.\n\n### Issue 4: Bustub-hyperloglog DinD build failure\n**Status**: Haiku sentinel run — DinD build never completed. Likely transient. Will be retried in rerun.\n\n### no_changes_guard audit\n**Result**: No `no_changes_guard` references found in any official run result files. No false-positive contamination.\n\n### OH launcher org task support\n**Verified**: `openhands_2config.sh` reads task_dir/benchmark from JSON directly. No filtering that skips csb_org_* tasks. The 3 org tasks in oh_full_rerun_20260310.json will work.\n\n### Remaining\n- The pkill fix needs commit+push\n- Then rerun all 12 tasks via: `--subset oh_full_rerun_20260310.json`\n- Tainted staging runs (openhands_sonnet46_20260309_{210054,223658,232133,232947,233609}) must NOT be promoted","original_size":null,"owner":"sjarmak@users.noreply.github.com","payload":"","pinned":0,"priority":2,"quality_score":null,"rig":"","role_bead":"","role_type":"","sender":"","source_repo":"","source_system":"","spec_id":"","status":"closed","target":"","timeout_ns":0,"title":"Investigate OH/Harbor infrastructure failures before rerun","updated_at":"2026-03-10T17:22:52Z","waiters":"","wisp_type":"","work_type":""}
 {"acceptance_criteria":"","actor":"","agent_state":"","assignee":null,"await_id":"","await_type":"","close_reason":"Epic complete. 275 tasks in benchmarks/csb/ across 9 merged suites, all with dual-score verifiers. Agent instructions updated to always produce both direct edits and answer.json. Extraction and reporting pipelines extended for dual scores.","closed_at":"2026-03-11T01:49:08Z","closed_by_session":"","compacted_at":null,"compacted_at_commit":null,"compaction_level":0,"content_hash":"e154dfef03983355cc3f5e6b40c4f0d1706e4e863944471b34db80c692fbb78a","created_at":"2026-03-11T01:15:58Z","created_by":"sjarmak","crystallizes":0,"defer_until":null,"description":"Epic: Every task run yields two independent scores (reward_direct from file edits, reward_artifact from answer.json). No mode switching — agent always does both. Requires changes to agent instructions, verifier infrastructure, result extraction, and all 275 task verifiers.","design":"","due_at":null,"ephemeral":0,"estimated_minutes":null,"event_kind":"","external_ref":null,"hook_bead":"","id":"CodeScaleBench-zrs","is_template":0,"issue_type":"feature","last_activity":null,"metadata":"{}","mol_type":"","notes":"Suite merge map: security(39)=sdlc_secure+org_security+org_compliance | debug(26)=sdlc_debug+org_incident | fix(19)=sdlc_fix | feature(34)=sdlc_feature+org_org | refactor(43)=sdlc_refactor+org_migration | understand(44)=sdlc_understand+sdlc_design+org_domain+org_onboarding | document(11)=sdlc_document | test(12)=sdlc_test | crossrepo(47)=org_crossrepo+org_crossrepo_tracing+org_crossorg+org_platform","original_size":null,"owner":"sjarmak@users.noreply.github.com","payload":"","pinned":0,"priority":1,"quality_score":null,"rig":"","role_bead":"","role_type":"","sender":"","source_repo":"","source_system":"","spec_id":"","status":"closed","target":"","timeout_ns":0,"title":"Unified dual-score benchmark: agent always produces both direct edits and answer.json","updated_at":"2026-03-11T01:49:08Z","waiters":"","wisp_type":"","work_type":""}