Skip to content

Conversation

@c-ehrlich
Copy link
Collaborator

Changes

Evaluation reports now show config flag differences against both the baseline run and the default values. Previously, config changes were only shown when comparing against a baseline—meaning first runs or runs without baselines showed no config context.

  • Add defaultFlagConfig to suite data from configEnd.flags
  • calculateFlagDiff now diffs against both baseline and defaults
  • Config changes section displays default and baseline values on separate lines
  • Scores default to {} instead of undefined to avoid null access errors (because we cast to Case, where it is expected for this to exist)
  • Added tests

Demo

Before:
image

After:
CleanShot 2025-11-28 at 15 41 57@2x

Copilot AI review requested due to automatic review settings November 28, 2025 08:47
@pkg-pr-new
Copy link

pkg-pr-new bot commented Nov 28, 2025

Open in StackBlitz

npm i https://pkg.pr.new/axiomhq/ai/axiom@174

commit: 3a8f22c

Copilot finished reviewing on behalf of c-ehrlich November 28, 2025 08:49
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances evaluation reports to display configuration flag differences against both baseline runs and default values. Previously, config changes were only shown when a baseline existed, providing no configuration context for initial runs or runs without baselines.

Key changes:

  • Added defaultFlagConfig field to suite data, populated from configEnd.flags
  • Enhanced calculateFlagDiff to compare against both baseline and default configurations
  • Updated display logic to show default and baseline values on separate lines
  • Changed scores default from undefined to {} to prevent null access errors when accessing baseline scores

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
packages/ai/src/evals/eval.types.ts Added default field to FlagDiff type
packages/ai/src/evals/reporter.ts Captures defaultFlagConfig from configEnd.flags and adds it to suite data
packages/ai/src/evals/reporter.console-utils.ts Enhanced calculateFlagDiff to compare against both baseline and defaults; updated printing logic to display both comparisons
packages/ai/src/evals/eval.service.ts Changed scores default from undefined to {} to prevent crashes when accessing baseline scores
packages/ai/test/evals/reporter.console-utils.test.ts Added comprehensive tests for new flag diff scenarios and display logic

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

logger(
`│ • ${flag}: ${current ?? '<not set>'} ${c.gray(`(baseline: ${baseline ?? '<not set>'})`)}`,
);
const hasConfigChanges = flagDiff.length > 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly flagDiff will only show up if there's a baseline right?

const flagDiff = suite.baseline ? calculateFlagDiff(suite) : [];

So we are not showing flagDiff for debug evals, is that intended?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants