feat: add slash command definitions for all 18 operational skills

LoCoBench Bot · claude · LoCoBench Bot · commit 746da58460f7 · 2026-02-17T00:33:44.000Z
Create .claude/commands/ with skill definitions for the full benchmark
lifecycle: pre-run, during-run, post-run, promotion, analysis, QA, and
maintenance phases.

New /promote-run skill documents the staging → official workflow.
Updated /watch-benchmarks and /run-status to mention --staging flag.
Updated /whats-next to check staging before recommending actions.
Updated CLAUDE.md workflow diagram to include Promotion phase.

Also unignore .claude/commands/ in .gitignore so skills are tracked.

Co-Authored-By: Claude Sonnet 4.5 &lt;noreply@anthropic.com&gt;
diff --git a/.claude/commands/archive-run.md b/.claude/commands/archive-run.md
@@ -0,0 +1,19 @@
+Move old runs to archive/, optional compression, dry-run by default.
+
+## Steps
+
+1. Dry-run (see what would be archived):
+```bash
+python3 scripts/archive_run.py
+```
+
+2. Archive a specific run:
+```bash
+python3 scripts/archive_run.py --execute <run_dir_name>
+```
+
+3. SAFETY: Before archiving, verify all tasks in the batch exist in a newer active batch. The MANIFEST merges across batches — archiving removes any tasks unique to that batch.
+
+## Arguments
+
+$ARGUMENTS — optional: run directory name, --execute, --older-than <days>
diff --git a/.claude/commands/benchmark-audit.md b/.claude/commands/benchmark-audit.md
@@ -0,0 +1,19 @@
+Audit benchmark against ABC framework (Task/Outcome/Reporting validity).
+
+## Steps
+
+1. Run benchmark audit:
+```bash
+python3 scripts/abc_audit.py
+```
+
+2. For a specific suite:
+```bash
+python3 scripts/abc_audit.py --suite ccb_navprove
+```
+
+3. Report findings by dimension
+
+## Arguments
+
+$ARGUMENTS — optional: --suite <name> to audit a specific benchmark
diff --git a/.claude/commands/check-infra.md b/.claude/commands/check-infra.md
@@ -0,0 +1,21 @@
+Verify infrastructure readiness before benchmark runs.
+
+Checks: OAuth tokens (all accounts), Docker, disk space, harbor CLI, runs/official/, runs/staging/.
+
+## Steps
+
+1. Run the infrastructure check:
+```bash
+python3 scripts/check_infra.py
+```
+
+2. Review results — FAIL items must be fixed before running benchmarks
+3. If staging_dir shows pending promotions, mention that runs are awaiting promotion (use /promote-run)
+4. If tokens are expired, suggest running headless login:
+```bash
+python3 scripts/headless_login.py --all-accounts
+```
+
+## Arguments
+
+$ARGUMENTS — none expected
diff --git a/.claude/commands/compare-configs.md b/.claude/commands/compare-configs.md
@@ -0,0 +1,19 @@
+Show divergent tasks across baseline/SG_full configs. Identifies "MCP helps" vs "MCP hurts" patterns.
+
+## Steps
+
+1. Run cross-config comparison:
+```bash
+python3 scripts/compare_configs.py
+```
+
+2. For MCP-conditioned analysis:
+```bash
+python3 scripts/compare_configs.py --mcp-analysis
+```
+
+3. Summarize: which tasks benefit from MCP, which are hurt, overall delta
+
+## Arguments
+
+$ARGUMENTS — optional: --suite <name>, --mcp-analysis
diff --git a/.claude/commands/cost-report.md b/.claude/commands/cost-report.md
@@ -0,0 +1,14 @@
+Token usage and estimated cost by suite/config, most expensive tasks.
+
+## Steps
+
+1. Run cost report:
+```bash
+python3 scripts/cost_report.py
+```
+
+2. Summarize: total cost, per-suite breakdown, most expensive tasks
+
+## Arguments
+
+$ARGUMENTS — optional: --suite <name>, --config <name>
diff --git a/.claude/commands/evaluate-traces.md b/.claude/commands/evaluate-traces.md
@@ -0,0 +1,16 @@
+Comprehensive trace evaluation: data integrity, output quality, efficiency analysis.
+
+Includes zero-MCP vs used-MCP classification.
+
+## Steps
+
+1. Run trace audit:
+```bash
+python3 scripts/audit_traces.py
+```
+
+2. Summarize: data integrity issues, MCP adoption rates, efficiency patterns
+
+## Arguments
+
+$ARGUMENTS — optional: --suite <name>, --config <name>
diff --git a/.claude/commands/generate-report.md b/.claude/commands/generate-report.md
@@ -0,0 +1,19 @@
+Generate aggregate CCB evaluation report from completed runs.
+
+## Steps
+
+1. Regenerate MANIFEST first to ensure it's current:
+```bash
+python3 scripts/generate_manifest.py
+```
+
+2. Generate evaluation report:
+```bash
+python3 scripts/generate_eval_report.py
+```
+
+3. Summarize key findings
+
+## Arguments
+
+$ARGUMENTS — none expected
diff --git a/.claude/commands/mcp-audit.md b/.claude/commands/mcp-audit.md
@@ -0,0 +1,14 @@
+MCP usage patterns: used vs zero-MCP, intensity buckets, reward/time deltas conditioned on actual MCP adoption.
+
+## Steps
+
+1. Run MCP audit:
+```bash
+python3 scripts/mcp_audit.py
+```
+
+2. Summarize: MCP adoption rates, tool usage patterns, reward deltas
+
+## Arguments
+
+$ARGUMENTS — optional: --suite <name>
diff --git a/.claude/commands/promote-run.md b/.claude/commands/promote-run.md
@@ -0,0 +1,33 @@
+Promote validated benchmark runs from staging to official.
+
+## Workflow
+
+1. List staging runs: `python3 scripts/promote_run.py --list`
+2. Review the output — look for READY status (0 criticals, all tasks completed)
+3. Dry-run a specific run: `python3 scripts/promote_run.py <run_name>`
+4. If gates pass, promote: `python3 scripts/promote_run.py --execute <run_name>`
+5. To promote all eligible: `python3 scripts/promote_run.py --execute --all`
+
+## Promotion Gates
+
+- 0 critical validation issues (hard gate)
+- All tasks have result.json (no running/missing tasks)
+- Warnings <= 10 (configurable with --max-warnings)
+- Use --force to bypass gates
+
+## After Promotion
+
+- Run is moved from runs/staging/ to runs/official/
+- MANIFEST.json is automatically regenerated
+- Run `python3 scripts/aggregate_status.py` to verify the promoted run appears
+
+## Arguments
+
+$ARGUMENTS — optional: run directory name(s) to promote, or --all for all eligible
+
+## Steps
+
+1. Run `python3 scripts/promote_run.py --list` to show current staging runs
+2. If the user provided a run name or $ARGUMENTS, dry-run validate it: `python3 scripts/promote_run.py <name>`
+3. Show the validation results and ask if the user wants to proceed with `--execute`
+4. If confirmed, run `python3 scripts/promote_run.py --execute <name>`
diff --git a/.claude/commands/quick-rerun.md b/.claude/commands/quick-rerun.md
@@ -0,0 +1,16 @@
+Rerun a single task with haiku model for fast verification.
+
+## Steps
+
+1. Identify the task to rerun (from failure triage or gap analysis)
+2. Find the benchmark suite and task directory
+3. Run with haiku for quick turnaround:
+```bash
+harbor run --path benchmarks/<suite>/<task_dir> --model haiku
+```
+
+4. Check results in runs/staging/
+
+## Arguments
+
+$ARGUMENTS — task name or path to rerun
diff --git a/.claude/commands/reextract-metrics.md b/.claude/commands/reextract-metrics.md
@@ -0,0 +1,19 @@
+Batch re-extract task_metrics.json after extraction bug fixes or schema changes.
+
+## Steps
+
+1. Run metric re-extraction:
+```bash
+python3 scripts/reextract_all_metrics.py
+```
+
+2. For a specific run:
+```bash
+python3 scripts/reextract_all_metrics.py --run-dir <dir>
+```
+
+3. Verify extracted metrics are correct
+
+## Arguments
+
+$ARGUMENTS — optional: --run-dir <dir>
diff --git a/.claude/commands/run-status.md b/.claude/commands/run-status.md
@@ -0,0 +1,24 @@
+Lightweight check on active run progress.
+
+## Steps
+
+1. Check for active runs in staging (new runs land here by default):
+```bash
+python3 scripts/aggregate_status.py --staging --since 120 --format table
+```
+
+2. If no staging runs found, check official:
+```bash
+python3 scripts/aggregate_status.py --since 120 --format table
+```
+
+3. Also check for running Docker containers:
+```bash
+docker ps --format 'table {{.Names}}\t{{.Status}}\t{{.RunningFor}}' | head -20
+```
+
+4. Summarize: which tasks are running, which completed, any errors
+
+## Arguments
+
+$ARGUMENTS — optional: --staging (check staging), --since <minutes> (default: 120)
diff --git a/.claude/commands/score-tasks.md b/.claude/commands/score-tasks.md
@@ -0,0 +1,19 @@
+Score task quality: instruction clarity, verifier quality, reproducibility.
+
+## Steps
+
+1. Run task scoring:
+```bash
+python3 scripts/abc_score_task.py
+```
+
+2. For a specific task:
+```bash
+python3 scripts/abc_score_task.py --task <task_name>
+```
+
+3. Report scores and areas for improvement
+
+## Arguments
+
+$ARGUMENTS — optional: --task <name>, --suite <name>
diff --git a/.claude/commands/sync-metadata.md b/.claude/commands/sync-metadata.md
@@ -0,0 +1,19 @@
+Reconcile task.toml vs selected_benchmark_tasks.json.
+
+## Steps
+
+1. Dry-run check:
+```bash
+python3 scripts/sync_task_metadata.py
+```
+
+2. Auto-fix mismatches:
+```bash
+python3 scripts/sync_task_metadata.py --fix
+```
+
+3. Report what was synced
+
+## Arguments
+
+$ARGUMENTS — optional: --fix (auto-update mismatches)
diff --git a/.claude/commands/triage-failure.md b/.claude/commands/triage-failure.md
@@ -0,0 +1,28 @@
+Investigate a failed task or analyze agent behavior on any task.
+
+Covers: MCP usage patterns, tool patterns, zero-MCP investigation, error root cause analysis.
+
+## Steps
+
+1. If the user specifies a task, find its result directory:
+```bash
+find runs/official/ runs/staging/ -name "result.json" -path "*<task_name>*" 2>/dev/null
+```
+
+2. Read the result.json to understand the failure:
+   - Check `exception_info` for error details
+   - Check `verifier_result` for scoring
+   - Check `agent_result` for token usage
+
+3. Read the agent transcript (claude-code.txt) for behavior analysis
+
+4. Use error fingerprinting if applicable:
+```bash
+python3 scripts/aggregate_status.py --failures-only --format table
+```
+
+5. Report findings: root cause, whether it's infra vs task difficulty, MCP usage patterns
+
+## Arguments
+
+$ARGUMENTS — task name or run directory to investigate
diff --git a/.claude/commands/validate-tasks.md b/.claude/commands/validate-tasks.md
@@ -0,0 +1,21 @@
+Pre-flight validation for benchmark task definitions.
+
+Catches: truncated instructions, template placeholders, metadata mismatches, missing test.sh.
+
+## Steps
+
+1. Run pre-flight validation:
+```bash
+python3 scripts/validate_tasks_preflight.py
+```
+
+2. If specific suite requested, filter:
+```bash
+python3 scripts/validate_tasks_preflight.py --suite ccb_navprove
+```
+
+3. Report any issues found and suggest fixes
+
+## Arguments
+
+$ARGUMENTS — optional: --suite <name> to validate a specific benchmark suite
diff --git a/.claude/commands/watch-benchmarks.md b/.claude/commands/watch-benchmarks.md
@@ -0,0 +1,30 @@
+Scan benchmark runs and report aggregate status with error fingerprinting.
+
+## Default: Scan Official Runs
+```bash
+python3 scripts/aggregate_status.py --format table
+```
+
+## Scan Staging Runs (in-progress or awaiting promotion)
+```bash
+python3 scripts/aggregate_status.py --staging --format table
+```
+
+## Watch Mode (continuous refresh)
+```bash
+python3 scripts/aggregate_status.py --watch --format table
+python3 scripts/aggregate_status.py --staging --watch --format table
+```
+
+## Arguments
+
+$ARGUMENTS — optional flags: --staging (scan staging instead of official), --suite <name>, --config <name>, --since <minutes>, --failures-only, --watch
+
+## Steps
+
+1. Run aggregate_status.py with appropriate flags based on user request
+2. If --staging flag or user mentions "staging", scan runs/staging/ instead of runs/official/
+3. If user asks about in-progress or recent runs, add --since flag
+4. For failures only, add --failures-only
+5. Summarize: total tasks, pass/fail/error counts, any notable patterns
+6. If staging runs are READY, suggest using /promote-run
diff --git a/.claude/commands/whats-next.md b/.claude/commands/whats-next.md
@@ -0,0 +1,28 @@
+Analyze current benchmark state and recommend next actions.
+
+## Steps
+
+1. Check staging for runs awaiting promotion:
+```bash
+python3 scripts/promote_run.py --list
+```
+
+2. Check official runs for overall status:
+```bash
+python3 scripts/aggregate_status.py --format table
+```
+
+3. Check for active Docker containers (in-progress runs):
+```bash
+docker ps --format 'table {{.Names}}\t{{.Status}}' 2>/dev/null | head -10
+```
+
+4. Analyze and recommend:
+   - If staging runs are READY → suggest /promote-run
+   - If runs have failures → suggest /triage-failure
+   - If configs are missing coverage → suggest which benchmarks to run next
+   - If all runs look good → suggest /generate-report or /compare-configs
+
+## Arguments
+
+$ARGUMENTS — none expected
diff --git a/.gitignore b/.gitignore
diff --git a/CLAUDE.md b/CLAUDE.md