feat: trials #140

c-ehrlich · 2025-11-19T07:52:17Z

No description provided.

pkg-pr-new · 2025-11-19T07:53:30Z

npm i https://pkg.pr.new/axiomhq/ai/axiom@140

commit: 72a5d68

Copilot

Pull Request Overview

This PR adds support for running multiple trials per test case in evaluations, allowing users to run each test case multiple times and average the scores across trials for more robust evaluation results.

Key changes:

Added trials parameter to EvalParams type to specify the number of times each test case should run (defaults to 1)
Introduced new semantic convention attribute ATTR_EVAL_TRIAL_INDEX for telemetry tracking
Refactored the evaluation execution logic to wrap test cases in a trial loop, compute averaged scores across trials, and report aggregated results

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 10 comments.

File	Description
packages/ai/src/otel/semconv/eval_proposal.ts	Adds the `ATTR_EVAL_TRIAL_INDEX` constant for tracking trial indices in telemetry
packages/ai/src/otel/semconv/attributes.ts	Exports the trial index attribute through the `Attr.Eval.Trial.Index` path
packages/ai/src/evals/eval.types.ts	Adds the optional `trials` parameter to `EvalParams` type definition with documentation
packages/ai/src/evals/eval.ts	Implements the trial loop logic, including score averaging, error handling per trial, and result aggregation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-19T08:01:28Z

packages/ai/src/evals/eval.ts

+                    } catch (e) {
+                      console.log(e);
+                      const error = e as Error;

-                      return {
-                        name: scorerName,
-                        ...result,
-                        metadata: { duration, startedAt: start, error: null },
-                      };
-                    },
-                    caseContext,
-                  );
-                }),
-              );
+                      const ctx = getEvalContext();
+                      outOfScopeFlags = ctx.outOfScopeFlags || ([] as OutOfScopeFlagAccess[]);

-              const scores = Object.fromEntries(scoreList.map((s) => [s.name, s]));
+                      lastError = error;
+                    }


The error handling in the trial loop silently catches errors and continues to the next trial. If a trial fails, only lastError is recorded, but the loop continues. This means:

If some trials succeed and some fail, only successful trials are averaged (which may be intended)

If all trials fail, the error is handled in the else if (lastError) block

However, there's a potential issue: when a trial error is caught, allOutOfScopeFlags.push(...outOfScopeFlags) at line 561 won't be executed for that trial. This inconsistency should be addressed - either push the flags before the try block ends, or ensure they're collected even on error.

Copilot · 2025-11-19T08:01:28Z

packages/ai/src/evals/eval.ts

+                    typeof lastTrial.output === 'string'
+                      ? lastTrial.output
+                      : JSON.stringify(lastTrial.output),
+                  [Attr.Eval.Case.Scores]: JSON.stringify(averagedScores ? averagedScores : {}),


The ternary check averagedScores ? averagedScores : {} is redundant. The averagedScores object is always defined (initialized as an empty object at line 578), so this condition will always be truthy. Consider simplifying to JSON.stringify(averagedScores).

Suggested change

[Attr.Eval.Case.Scores]: JSON.stringify(averagedScores ? averagedScores : {}),

[Attr.Eval.Case.Scores]: JSON.stringify(averagedScores),

Copilot · 2025-11-19T08:01:28Z

packages/ai/src/evals/eval.ts

+                      scorerValues.length;
+
+                    averagedScores[scorerName] = {
+                      ...scorerValues[0],


When constructing the averaged score, spreading scorerValues[0] copies all properties including the metadata field. However, the metadata (containing duration, startedAt, error) is specific to the first trial and doesn't represent the averaged trials. Consider either:

Omitting metadata from averaged scores

Creating new metadata that reflects all trials (e.g., total duration, list of durations)

Explicitly selecting which fields to copy: { name: scorerValues[0].name, score: avgScore, metadata: scorerValues[0].metadata }

The current implementation may be misleading as the metadata doesn't correspond to the averaged score.

Suggested change

...scorerValues[0],

name: scorerValues[0].name,

Copilot · 2025-11-19T08:01:29Z

packages/ai/src/evals/eval.ts

+                      trialSpan.setAttributes({
+                        [Attr.Eval.Case.Output]:
+                          typeof output === 'string' ? output : JSON.stringify(output),
+                        [Attr.Eval.Case.Scores]: JSON.stringify(scores ? scores : {}),


The ternary check scores ? scores : {} is redundant. The scores object is always defined (created from the scoreList at line 551), so this condition will always be truthy. Consider simplifying to JSON.stringify(scores).

Suggested change

[Attr.Eval.Case.Scores]: JSON.stringify(scores ? scores : {}),

[Attr.Eval.Case.Scores]: JSON.stringify(scores),

Copilot · 2025-11-19T08:01:29Z

packages/ai/src/evals/eval.ts

+                  pickedFlags: opts.configFlags,
+                };
+              } else if (lastError) {
+                const error: Error = lastError;


The redundant assignment const error: Error = lastError; is unnecessary since lastError is already typed as Error | null and checked to be truthy at line 624. Consider using lastError directly instead of creating a new variable.

Copilot · 2025-11-19T08:01:29Z

packages/ai/src/evals/eval.ts

+                const failedScores: Record<string, ScoreWithName> = {};
+                for (const scorer of opts.scorers) {
+                  failedScores[scorer.name] = {
+                    name: scorer.name,
+                    score: 0,
+                    metadata: {
+                      duration: 0,
+                      startedAt: start,
+                      error: error.message,
+                    },
+                  };
+                }
+
+                task.meta.case = {
+                  name: evalName,
+                  index: data.index,
+                  expected: data.expected,
+                  input: data.input,
+                  output: String(error),
+                  scores: failedScores,
+                  status: 'fail',
+                  errors: [error],
+                  startedAt: start,
+                  duration: Math.round(performance.now() - start),
+                  outOfScopeFlags,
+                  pickedFlags: opts.configFlags,
+                };


This error handling block is nearly identical to the outer catch block at lines 661-687. The logic for creating failedScores and populating task.meta.case is duplicated. Consider extracting this into a helper function to reduce duplication and improve maintainability.

Copilot · 2025-11-19T08:01:30Z

packages/ai/src/evals/eval.ts

+                  failedScores[scorer.name] = {
+                    name: scorer.name,


Direct access to scorer.name may fail if the scorer doesn't have a name property. The code elsewhere uses getScorerName(scorer) helper function (defined at line 171) which provides a fallback. Consider using the same helper here for consistency and to avoid potential undefined values:

const scorerName = getScorerName(scorer); failedScores[scorerName] = { name: scorerName, ... };

Suggested change

failedScores[scorer.name] = {

name: scorer.name,

const scorerName = getScorerName(scorer);

failedScores[scorerName] = {

name: scorerName,

Copilot · 2025-11-19T08:01:30Z

packages/ai/src/evals/eval.ts

-              // Populate scores with error metadata for all scorers that didn't run
              const failedScores: Record<string, ScoreWithName> = {};
              for (const scorer of opts.scorers) {
                failedScores[scorer.name] = {


Direct access to scorer.name may fail if the scorer doesn't have a name property. The code elsewhere uses getScorerName(scorer) helper function (defined at line 171) which provides a fallback. Consider using the same helper here for consistency and to avoid potential undefined values:

const scorerName = getScorerName(scorer); failedScores[scorerName] = { name: scorerName, ... };

Copilot · 2025-11-19T08:01:30Z

packages/ai/src/evals/eval.ts

+              if (trialResults.length > 0) {
+                const scorerNames = Object.keys(trialResults[0].scores);
+                const averagedScores: Record<string, ScoreWithName> = {};

-              // set task meta for showing result in vitest report
-              task.meta.case = {
-                index: data.index,
-                name: evalName,
-                expected: data.expected,
-                input: data.input,
-                output: output,
-                scores,
-                status: 'success',
-                errors: [],
-                duration,
-                startedAt: start,
-                outOfScopeFlags,
-                pickedFlags: opts.configFlags,
-              };
+                for (const scorerName of scorerNames) {
+                  const scorerValues = trialResults
+                    .map((trial) => trial.scores[scorerName])
+                    .filter((s) => s !== undefined);

-              // Collect out-of-scope flags for evaluation-level aggregation
-              allOutOfScopeFlags.push(...outOfScopeFlags);
+                  if (scorerValues.length > 0) {
+                    const avgScore =
+                      scorerValues.reduce((sum, s) => sum + (s.score as number), 0) /
+                      scorerValues.length;
+
+                    averagedScores[scorerName] = {
+                      ...scorerValues[0],
+                      score: avgScore,
+                    };
+                  }
+                }
+
+                const avgDuration =
+                  trialResults.reduce((sum, t) => sum + t.duration, 0) / trialResults.length;
+
+                const lastTrial = trialResults[trialResults.length - 1];
+
+                caseSpan.setAttributes({
+                  [Attr.Eval.Case.Output]:
+                    typeof lastTrial.output === 'string'
+                      ? lastTrial.output
+                      : JSON.stringify(lastTrial.output),
+                  [Attr.Eval.Case.Scores]: JSON.stringify(averagedScores ? averagedScores : {}),
+                });
+
+                task.meta.case = {
+                  index: data.index,
+                  name: evalName,
+                  expected: data.expected,
+                  input: data.input,
+                  output: lastTrial.output,
+                  scores: averagedScores,
+                  status: 'success',
+                  errors: [],
+                  duration: Math.round(avgDuration),
+                  startedAt: start,
+                  outOfScopeFlags,
+                  pickedFlags: opts.configFlags,
+                };


When no trials complete successfully (trialResults.length === 0) but lastError is also null, no task.meta.case is set. This can happen if an unexpected error bypasses the trial error catching. Consider adding an else clause to handle this edge case or asserting that at least one of the conditions must be true.

Copilot · 2025-11-19T08:01:30Z

packages/ai/src/evals/eval.ts

-                      });
+                      allOutOfScopeFlags.push(...outOfScopeFlags);
+                    } catch (e) {
+                      console.log(e);


Using console.log for error logging is not ideal for production code. Consider using a proper logger or at minimum console.error for error cases. Additionally, the error is logged but then silently swallowed, which may make debugging difficult.

Suggested change

console.log(e);

console.error(e);

attempt to re-implement trials

72a5d68

Copilot AI review requested due to automatic review settings November 19, 2025 07:52

Copilot started reviewing on behalf of c-ehrlich November 19, 2025 07:52 View session

Copilot finished reviewing on behalf of c-ehrlich November 19, 2025 07:59

Copilot AI reviewed Nov 19, 2025

View reviewed changes

c-ehrlich marked this pull request as draft November 24, 2025 14:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: trials #140

feat: trials #140

c-ehrlich commented Nov 19, 2025

Uh oh!

pkg-pr-new bot commented Nov 19, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 19, 2025

Uh oh!

Copilot AI Nov 19, 2025

Uh oh!

Copilot AI Nov 19, 2025

Uh oh!

Copilot AI Nov 19, 2025

Uh oh!

Copilot AI Nov 19, 2025

Uh oh!

Copilot AI Nov 19, 2025

Uh oh!

Copilot AI Nov 19, 2025

Uh oh!

Copilot AI Nov 19, 2025

Uh oh!

Copilot AI Nov 19, 2025

Uh oh!

Copilot AI Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	[Attr.Eval.Case.Scores]: JSON.stringify(averagedScores ? averagedScores : {}),
	[Attr.Eval.Case.Scores]: JSON.stringify(averagedScores),

feat: trials #140

Are you sure you want to change the base?

feat: trials #140

Conversation

c-ehrlich commented Nov 19, 2025

Uh oh!

pkg-pr-new bot commented Nov 19, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant