feat: better eval reporting #93

c-ehrlich · 2025-10-01T03:41:10Z

In this PR

Better end-of-run report (see screenshot)
Mid-run reporting simplified (log lines from different evals no longer get mixed)
Better baseline handling
Fix some runner issues

pkg-pr-new · 2025-10-01T03:42:47Z

npm i https://pkg.pr.new/axiomhq/ai/axiom@93

commit: 2b74789

* remove NodeSDK from instrumentation setups * uninstall NodeSDK * update docs * better tools

Copilot

Pull Request Overview

This PR enhances evaluation reporting capabilities with improved end-of-run reports, simplified mid-run reporting to prevent log mixing, better baseline handling, and fixes for runner issues.

Adds comprehensive final report generation with scorecard-style output and baseline comparisons
Implements per-suite baseline handling instead of global baseline storage
Introduces flag configuration tracking and comparison with baselines
Refactors timeout configuration to be environment-variable driven

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
packages/ai/src/util/dot-path.ts	Adds flattenObject utility for converting nested objects to dot notation
packages/ai/src/evals/run-vitest.ts	Updates timeout configuration to use environment variable
packages/ai/src/evals/reporter.ts	Major refactor to support final report generation and per-suite baseline handling
packages/ai/src/evals/reporter.console-utils.ts	Adds comprehensive reporting utilities including final report formatting
packages/ai/src/evals/instrument.ts	Fixes instrumentation re-initialization for vitest worker processes
packages/ai/src/evals/eval.types.ts	Adds new types for flag diffs and out-of-scope flags
packages/ai/src/evals/eval.ts	Enhances flag configuration capture and improves data collection
packages/ai/src/evals/eval.service.ts	Improves baseline query and adds flag config mapping
examples/example-evals-nextjs/test/feature.eval.ts	Updates example evaluation with new names and unused flag
examples/example-evals-nextjs/src/lib/capabilities/classify-ticket/evaluations/ticket-classification.eval.ts	Adds additional test case for spam classification

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-03T10:19:59Z

packages/ai/src/evals/reporter.ts

  printTestCaseSuccessOrFailed,
+  type SuiteData,
 } from './reporter.console-utils';
+import { flattenObject } from 'src/util/dot-path';


Import path should use relative path instead of 'src/' alias for consistency with other imports in the file.

Suggested change

import { flattenObject } from 'src/util/dot-path';

import { flattenObject } from '../util/dot-path';

Copilot · 2025-10-03T10:19:59Z

packages/ai/src/evals/reporter.console-utils.ts

+    for (const flag of suite.outOfScopeFlags) {
+      const lastStackTraceFrame = flag.stackTrace[0];
+      const lastStackTraceFnName = lastStackTraceFrame.split(' ').shift();
+      const lastStackTraceFile = lastStackTraceFrame.split('/').pop()?.slice(0, -1);


Magic number -1 in slice() is unclear. Consider adding a comment explaining that it removes the closing parenthesis, or use a more descriptive approach.

Suggested change

const lastStackTraceFile = lastStackTraceFrame.split('/').pop()?.slice(0, -1);

// Remove trailing closing parenthesis from file name, if present

const lastStackTraceFile = lastStackTraceFrame.split('/').pop()?.replace(/\)$/, '');

thesollyz · 2025-10-03T11:59:44Z

packages/ai/src/evals/eval.ts

+              [Attr.GenAI.Operation.Name]: 'eval.case',
+              [Attr.Eval.ID]: evalId,
+              [Attr.Eval.Name]: evalName,
+              [Attr.Eval.Version]: evalName,


Suggested change

[Attr.Eval.Version]: evalName,

[Attr.Eval.Version]: evalVersion,

this should be eval version right?

thesollyz · 2025-10-03T11:59:58Z

packages/ai/src/evals/eval.ts

+                    [Attr.GenAI.Operation.Name]: 'eval.score',
+                    [Attr.Eval.ID]: evalId,
+                    [Attr.Eval.Name]: evalName,
+                    [Attr.Eval.Version]: evalName,


Suggested change

[Attr.Eval.Version]: evalName,

[Attr.Eval.Version]: evalVersion,

c-ehrlich added 6 commits September 30, 2025 15:04

fix typo

bbb2863

print final report

ca6e3b0

define this type

6d1af25

increase default timeout

03e08e7

basic end-of-suite report

3c63bca

REVERTME

84dc7cb

c-ehrlich added 20 commits October 3, 2025 15:40

feat: update otel setup (#94)

b38f6e8

* remove NodeSDK from instrumentation setups * uninstall NodeSDK * update docs * better tools

add a bg here

0a6db01

final report summary

2b862eb

use % scores

d1244c9

better report

e838b78

a better final reporter

d048967

better baseline detection

6fe842d

record traces from worker process

1d63c86

more runner changes

128e19a

Merge branch 'main' into reporter-1

58834a8

misc cleanup

f9c1018

undo this diff

b29a389

misc cleanup

9a2bcac

cleanup

480a9a7

more cleanup

6cf8a1c

better mid-run

921d59b

right color for this closing bracket

e5bc520

print baseline here again

73e362e

also this

b4aa5db

remove dumb comments

2b74789

c-ehrlich marked this pull request as ready for review October 3, 2025 10:19

Copilot AI review requested due to automatic review settings October 3, 2025 10:19

Copilot AI reviewed Oct 3, 2025

View reviewed changes

thesollyz requested a review from gabrielelpidio October 3, 2025 11:28

thesollyz reviewed Oct 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: better eval reporting #93

feat: better eval reporting #93

Uh oh!

c-ehrlich commented Oct 1, 2025 •

edited

Loading

Uh oh!

pkg-pr-new bot commented Oct 1, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 3, 2025

Uh oh!

Copilot AI Oct 3, 2025

Uh oh!

thesollyz Oct 3, 2025

Uh oh!

thesollyz Oct 3, 2025

Uh oh!

Uh oh!

	import { flattenObject } from 'src/util/dot-path';
	import { flattenObject } from '../util/dot-path';

	const lastStackTraceFile = lastStackTraceFrame.split('/').pop()?.slice(0, -1);
	// Remove trailing closing parenthesis from file name, if present
	const lastStackTraceFile = lastStackTraceFrame.split('/').pop()?.replace(/\)$/, '');

	[Attr.Eval.Version]: evalName,
	[Attr.Eval.Version]: evalVersion,

feat: better eval reporting #93

Are you sure you want to change the base?

feat: better eval reporting #93

Uh oh!

Conversation

c-ehrlich commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

In this PR

Uh oh!

pkg-pr-new bot commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

thesollyz Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

thesollyz Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

c-ehrlich commented Oct 1, 2025 •

edited

Loading

pkg-pr-new bot commented Oct 1, 2025 •

edited

Loading