Conversation
… test agent but pass with general agent
…rs and table relation
…djukic/BCBenchScript2
There was a problem hiding this comment.
You cannot check the change to config.yaml file in. I would suggest to separate changes to the scripts from AlTest.agent.md. The latter should be run at least 5 times to see our score on BC Bench. It should be more than the existing version of AL test agent
There was a problem hiding this comment.
config.yaml change is ok for now, you'll need it to run things.
But do separate the changes for the scripts
There was a problem hiding this comment.
We need to run this at least 5 times to see if this is going to perform better than the existing AL test agent. Let's not push that change to master yet.
|
Run 1 completed: https://github.com/microsoft/BC-Bench/actions/runs/24083046010 Run 2 in progress: https://github.com/microsoft/BC-Bench/actions/runs/24148965983 |
There was a problem hiding this comment.
Can you please move those scripts into tools/altest folder (you will have to create those folders) instead? And leave a README under tools folder saying those scripts are designed for downloading and analyzing GitHub Action artifacts
the scripts folder is designed for powershell scripts used in environment setup, etc.
And also mention it in
BC-Bench/.github/copilot-instructions.md
Lines 5 to 10 in 143a74e
New script: Get-WorkflowSummary.ps1
New script: Get-WorkflowSummary.ps1
(vscode-file://vscode-app/c:/Users/milicadjukic/AppData/Local/Programs/Microsoft%20VS%20Code/e7fb5e96c0/resources/app/out/vs/code/electron-browser/workbench/workbench.html)
PowerShell script that fetches evaluation workflow run summaries from GitHub Actions, downloads JSONL artifacts (including from nested zips), and optionally copies them into a stable output folder for analysis.
New script: bcbench_analyze_artifacts.py
New script: bcbench_analyze_artifacts.py
(vscode-file://vscode-app/c:/Users/milicadjukic/AppData/Local/Programs/Microsoft%20VS%20Code/e7fb5e96c0/resources/app/out/vs/code/electron-browser/workbench/workbench.html)
Python script for offline analysis of downloaded BC-Bench artifacts. Supports ZIP and pre-extracted input modes. Produces summary CSVs, top failures, grouped errors, and extracts generated test code/patches per test ID.
New script: group_errors_from_summary.py
New script: group_errors_from_summary.py
(vscode-file://vscode-app/c:/Users/milicadjukic/AppData/Local/Programs/Microsoft%20VS%20Code/e7fb5e96c0/resources/app/out/vs/code/electron-browser/workbench/workbench.html)
Python script that groups errors from a summary CSV into high-level categories (tests passed pre-patch, failed post-patch, build failures, etc.).