Skip to content

Commit 746da58

Browse files
LoCoBench Botclaude
andcommitted
feat: add slash command definitions for all 18 operational skills
Create .claude/commands/ with skill definitions for the full benchmark lifecycle: pre-run, during-run, post-run, promotion, analysis, QA, and maintenance phases. New /promote-run skill documents the staging → official workflow. Updated /watch-benchmarks and /run-status to mention --staging flag. Updated /whats-next to check staging before recommending actions. Updated CLAUDE.md workflow diagram to include Promotion phase. Also unignore .claude/commands/ in .gitignore so skills are tracked. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
1 parent df67d0a commit 746da58

20 files changed

+397
-11
lines changed

.claude/commands/archive-run.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
Move old runs to archive/, optional compression, dry-run by default.
2+
3+
## Steps
4+
5+
1. Dry-run (see what would be archived):
6+
```bash
7+
python3 scripts/archive_run.py
8+
```
9+
10+
2. Archive a specific run:
11+
```bash
12+
python3 scripts/archive_run.py --execute <run_dir_name>
13+
```
14+
15+
3. SAFETY: Before archiving, verify all tasks in the batch exist in a newer active batch. The MANIFEST merges across batches — archiving removes any tasks unique to that batch.
16+
17+
## Arguments
18+
19+
$ARGUMENTS — optional: run directory name, --execute, --older-than <days>
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
Audit benchmark against ABC framework (Task/Outcome/Reporting validity).
2+
3+
## Steps
4+
5+
1. Run benchmark audit:
6+
```bash
7+
python3 scripts/abc_audit.py
8+
```
9+
10+
2. For a specific suite:
11+
```bash
12+
python3 scripts/abc_audit.py --suite ccb_navprove
13+
```
14+
15+
3. Report findings by dimension
16+
17+
## Arguments
18+
19+
$ARGUMENTS — optional: --suite <name> to audit a specific benchmark

.claude/commands/check-infra.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
Verify infrastructure readiness before benchmark runs.
2+
3+
Checks: OAuth tokens (all accounts), Docker, disk space, harbor CLI, runs/official/, runs/staging/.
4+
5+
## Steps
6+
7+
1. Run the infrastructure check:
8+
```bash
9+
python3 scripts/check_infra.py
10+
```
11+
12+
2. Review results — FAIL items must be fixed before running benchmarks
13+
3. If staging_dir shows pending promotions, mention that runs are awaiting promotion (use /promote-run)
14+
4. If tokens are expired, suggest running headless login:
15+
```bash
16+
python3 scripts/headless_login.py --all-accounts
17+
```
18+
19+
## Arguments
20+
21+
$ARGUMENTS — none expected
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
Show divergent tasks across baseline/SG_full configs. Identifies "MCP helps" vs "MCP hurts" patterns.
2+
3+
## Steps
4+
5+
1. Run cross-config comparison:
6+
```bash
7+
python3 scripts/compare_configs.py
8+
```
9+
10+
2. For MCP-conditioned analysis:
11+
```bash
12+
python3 scripts/compare_configs.py --mcp-analysis
13+
```
14+
15+
3. Summarize: which tasks benefit from MCP, which are hurt, overall delta
16+
17+
## Arguments
18+
19+
$ARGUMENTS — optional: --suite <name>, --mcp-analysis

.claude/commands/cost-report.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
Token usage and estimated cost by suite/config, most expensive tasks.
2+
3+
## Steps
4+
5+
1. Run cost report:
6+
```bash
7+
python3 scripts/cost_report.py
8+
```
9+
10+
2. Summarize: total cost, per-suite breakdown, most expensive tasks
11+
12+
## Arguments
13+
14+
$ARGUMENTS — optional: --suite <name>, --config <name>
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
Comprehensive trace evaluation: data integrity, output quality, efficiency analysis.
2+
3+
Includes zero-MCP vs used-MCP classification.
4+
5+
## Steps
6+
7+
1. Run trace audit:
8+
```bash
9+
python3 scripts/audit_traces.py
10+
```
11+
12+
2. Summarize: data integrity issues, MCP adoption rates, efficiency patterns
13+
14+
## Arguments
15+
16+
$ARGUMENTS — optional: --suite <name>, --config <name>
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
Generate aggregate CCB evaluation report from completed runs.
2+
3+
## Steps
4+
5+
1. Regenerate MANIFEST first to ensure it's current:
6+
```bash
7+
python3 scripts/generate_manifest.py
8+
```
9+
10+
2. Generate evaluation report:
11+
```bash
12+
python3 scripts/generate_eval_report.py
13+
```
14+
15+
3. Summarize key findings
16+
17+
## Arguments
18+
19+
$ARGUMENTS — none expected

.claude/commands/mcp-audit.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
MCP usage patterns: used vs zero-MCP, intensity buckets, reward/time deltas conditioned on actual MCP adoption.
2+
3+
## Steps
4+
5+
1. Run MCP audit:
6+
```bash
7+
python3 scripts/mcp_audit.py
8+
```
9+
10+
2. Summarize: MCP adoption rates, tool usage patterns, reward deltas
11+
12+
## Arguments
13+
14+
$ARGUMENTS — optional: --suite <name>

.claude/commands/promote-run.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
Promote validated benchmark runs from staging to official.
2+
3+
## Workflow
4+
5+
1. List staging runs: `python3 scripts/promote_run.py --list`
6+
2. Review the output — look for READY status (0 criticals, all tasks completed)
7+
3. Dry-run a specific run: `python3 scripts/promote_run.py <run_name>`
8+
4. If gates pass, promote: `python3 scripts/promote_run.py --execute <run_name>`
9+
5. To promote all eligible: `python3 scripts/promote_run.py --execute --all`
10+
11+
## Promotion Gates
12+
13+
- 0 critical validation issues (hard gate)
14+
- All tasks have result.json (no running/missing tasks)
15+
- Warnings <= 10 (configurable with --max-warnings)
16+
- Use --force to bypass gates
17+
18+
## After Promotion
19+
20+
- Run is moved from runs/staging/ to runs/official/
21+
- MANIFEST.json is automatically regenerated
22+
- Run `python3 scripts/aggregate_status.py` to verify the promoted run appears
23+
24+
## Arguments
25+
26+
$ARGUMENTS — optional: run directory name(s) to promote, or --all for all eligible
27+
28+
## Steps
29+
30+
1. Run `python3 scripts/promote_run.py --list` to show current staging runs
31+
2. If the user provided a run name or $ARGUMENTS, dry-run validate it: `python3 scripts/promote_run.py <name>`
32+
3. Show the validation results and ask if the user wants to proceed with `--execute`
33+
4. If confirmed, run `python3 scripts/promote_run.py --execute <name>`

.claude/commands/quick-rerun.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
Rerun a single task with haiku model for fast verification.
2+
3+
## Steps
4+
5+
1. Identify the task to rerun (from failure triage or gap analysis)
6+
2. Find the benchmark suite and task directory
7+
3. Run with haiku for quick turnaround:
8+
```bash
9+
harbor run --path benchmarks/<suite>/<task_dir> --model haiku
10+
```
11+
12+
4. Check results in runs/staging/
13+
14+
## Arguments
15+
16+
$ARGUMENTS — task name or path to rerun

0 commit comments

Comments
 (0)