Skip to content

Add evaluation analysis scripts#605

Open
djukicmilica wants to merge 65 commits intomainfrom
private/milicadjukic/BCBenchScript2
Open

Add evaluation analysis scripts#605
djukicmilica wants to merge 65 commits intomainfrom
private/milicadjukic/BCBenchScript2

Conversation

@djukicmilica
Copy link
Copy Markdown
Collaborator

@djukicmilica djukicmilica commented Apr 7, 2026

New script: Get-WorkflowSummary.ps1

New script: Get-WorkflowSummary.ps1
(vscode-file://vscode-app/c:/Users/milicadjukic/AppData/Local/Programs/Microsoft%20VS%20Code/e7fb5e96c0/resources/app/out/vs/code/electron-browser/workbench/workbench.html)
PowerShell script that fetches evaluation workflow run summaries from GitHub Actions, downloads JSONL artifacts (including from nested zips), and optionally copies them into a stable output folder for analysis.

New script: bcbench_analyze_artifacts.py

New script: bcbench_analyze_artifacts.py
(vscode-file://vscode-app/c:/Users/milicadjukic/AppData/Local/Programs/Microsoft%20VS%20Code/e7fb5e96c0/resources/app/out/vs/code/electron-browser/workbench/workbench.html)
Python script for offline analysis of downloaded BC-Bench artifacts. Supports ZIP and pre-extracted input modes. Produces summary CSVs, top failures, grouped errors, and extracts generated test code/patches per test ID.

New script: group_errors_from_summary.py

New script: group_errors_from_summary.py
(vscode-file://vscode-app/c:/Users/milicadjukic/AppData/Local/Programs/Microsoft%20VS%20Code/e7fb5e96c0/resources/app/out/vs/code/electron-browser/workbench/workbench.html)
Python script that groups errors from a summary CSV into high-level categories (tests passed pre-patch, failed post-patch, build failures, etc.).

ventselartur and others added 30 commits December 19, 2025 19:00
@djukicmilica djukicmilica requested a review from haoranpb April 7, 2026 15:44
@djukicmilica djukicmilica marked this pull request as ready for review April 7, 2026 15:45
Copy link
Copy Markdown
Collaborator

@ventselartur ventselartur Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You cannot check the change to config.yaml file in. I would suggest to separate changes to the scripts from AlTest.agent.md. The latter should be run at least 5 times to see our score on BC Bench. It should be more than the existing version of AL test agent

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config.yaml change is ok for now, you'll need it to run things.

But do separate the changes for the scripts

Copy link
Copy Markdown
Collaborator

@ventselartur ventselartur Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to run this at least 5 times to see if this is going to perform better than the existing AL test agent. Let's not push that change to master yet.

@haoranpb
Copy link
Copy Markdown
Collaborator

haoranpb commented Apr 9, 2026

Run 1 completed: https://github.com/microsoft/BC-Bench/actions/runs/24083046010

Run 2 in progress: https://github.com/microsoft/BC-Bench/actions/runs/24148965983

@djukicmilica djukicmilica changed the title Enable ALTest agent and add evaluation analysis scripts Add evaluation analysis scripts Apr 9, 2026
Copy link
Copy Markdown
Collaborator

@haoranpb haoranpb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please move those scripts into tools/altest folder (you will have to create those folders) instead? And leave a README under tools folder saying those scripts are designed for downloading and analyzing GitHub Action artifacts

the scripts folder is designed for powershell scripts used in environment setup, etc.

And also mention it in

- **Dataset**: Benchmark entries following SWE-Bench schema with BC-specific adjustments
- **Python Package** (`src/bcbench/`): CLI tools, agent implementations, and validation utilities
- **PowerShell Scripts** (`scripts/`): Environment setup and dataset verification using AL-GO/BCContainerHelper
- **Agent Evaluations**: Focuses on GitHub Copilot CLI and Claude Code
- **Experiments**: MCP Servers, custom instructions, custom agents, skills, etc. and their performance on the benchmark
- **Notebooks** (`notebooks/`): Analysis and visualization of benchmark results

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants