Skip to content

Commit ff96d5d

Browse files
authored
🐛 fix: simplify results aggregation with cleaner implementation (#179)
1 parent f350cd4 commit ff96d5d

File tree

4 files changed

+501
-1947
lines changed

4 files changed

+501
-1947
lines changed

README.md

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -155,9 +155,14 @@ You can also follow `docs/quickstart.md` for the shortest end-to-end path.
155155
- Results are organized under `./results/{exp_name}/{model}__{mcp}/run-*/` (JSON + CSV per task).
156156
- Generate a summary with:
157157
```bash
158+
# Basic usage
158159
python -m src.aggregators.aggregate_results --exp-name exp
160+
161+
# For k-run experiments with single-run models
162+
python -m src.aggregators.aggregate_results --exp-name exp --k 4 --single-run-models claude-opus-4-1
159163
```
160-
- Includes multi-run metrics (e.g., pass@k) for stability comparisons.
164+
- Only models with complete results across all tasks and runs are included in the final summary.
165+
- Includes multi-run metrics (pass@k, pass^k) for stability comparisons when k > 1.
161166

162167
---
163168

0 commit comments

Comments
 (0)