Skip to content

Conversation

c-ehrlich
Copy link
Collaborator

@c-ehrlich c-ehrlich commented Oct 1, 2025

In this PR

  • Better end-of-run report (see screenshot)
  • Mid-run reporting simplified (log lines from different evals no longer get mixed)
  • Better baseline handling
  • Fix some runner issues
image

Copy link

pkg-pr-new bot commented Oct 1, 2025

Open in StackBlitz

npm i https://pkg.pr.new/axiomhq/ai/axiom@93

commit: 2b74789

@c-ehrlich c-ehrlich marked this pull request as ready for review October 3, 2025 10:19
@Copilot Copilot AI review requested due to automatic review settings October 3, 2025 10:19
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances evaluation reporting capabilities with improved end-of-run reports, simplified mid-run reporting to prevent log mixing, better baseline handling, and fixes for runner issues.

  • Adds comprehensive final report generation with scorecard-style output and baseline comparisons
  • Implements per-suite baseline handling instead of global baseline storage
  • Introduces flag configuration tracking and comparison with baselines
  • Refactors timeout configuration to be environment-variable driven

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
packages/ai/src/util/dot-path.ts Adds flattenObject utility for converting nested objects to dot notation
packages/ai/src/evals/run-vitest.ts Updates timeout configuration to use environment variable
packages/ai/src/evals/reporter.ts Major refactor to support final report generation and per-suite baseline handling
packages/ai/src/evals/reporter.console-utils.ts Adds comprehensive reporting utilities including final report formatting
packages/ai/src/evals/instrument.ts Fixes instrumentation re-initialization for vitest worker processes
packages/ai/src/evals/eval.types.ts Adds new types for flag diffs and out-of-scope flags
packages/ai/src/evals/eval.ts Enhances flag configuration capture and improves data collection
packages/ai/src/evals/eval.service.ts Improves baseline query and adds flag config mapping
examples/example-evals-nextjs/test/feature.eval.ts Updates example evaluation with new names and unused flag
examples/example-evals-nextjs/src/lib/capabilities/classify-ticket/evaluations/ticket-classification.eval.ts Adds additional test case for spam classification

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

printTestCaseSuccessOrFailed,
type SuiteData,
} from './reporter.console-utils';
import { flattenObject } from 'src/util/dot-path';
Copy link
Preview

Copilot AI Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import path should use relative path instead of 'src/' alias for consistency with other imports in the file.

Suggested change
import { flattenObject } from 'src/util/dot-path';
import { flattenObject } from '../util/dot-path';

Copilot uses AI. Check for mistakes.

for (const flag of suite.outOfScopeFlags) {
const lastStackTraceFrame = flag.stackTrace[0];
const lastStackTraceFnName = lastStackTraceFrame.split(' ').shift();
const lastStackTraceFile = lastStackTraceFrame.split('/').pop()?.slice(0, -1);
Copy link
Preview

Copilot AI Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic number -1 in slice() is unclear. Consider adding a comment explaining that it removes the closing parenthesis, or use a more descriptive approach.

Suggested change
const lastStackTraceFile = lastStackTraceFrame.split('/').pop()?.slice(0, -1);
// Remove trailing closing parenthesis from file name, if present
const lastStackTraceFile = lastStackTraceFrame.split('/').pop()?.replace(/\)$/, '');

Copilot uses AI. Check for mistakes.

[Attr.GenAI.Operation.Name]: 'eval.case',
[Attr.Eval.ID]: evalId,
[Attr.Eval.Name]: evalName,
[Attr.Eval.Version]: evalName,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[Attr.Eval.Version]: evalName,
[Attr.Eval.Version]: evalVersion,

this should be eval version right?

[Attr.GenAI.Operation.Name]: 'eval.score',
[Attr.Eval.ID]: evalId,
[Attr.Eval.Name]: evalName,
[Attr.Eval.Version]: evalName,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[Attr.Eval.Version]: evalName,
[Attr.Eval.Version]: evalVersion,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants