Skip to content

Conversation

@c-ehrlich
Copy link
Collaborator

No description provided.

Copilot AI review requested due to automatic review settings November 19, 2025 07:52
@pkg-pr-new
Copy link

pkg-pr-new bot commented Nov 19, 2025

Open in StackBlitz

npm i https://pkg.pr.new/axiomhq/ai/axiom@140

commit: 72a5d68

Copilot finished reviewing on behalf of c-ehrlich November 19, 2025 07:59
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for running multiple trials per test case in evaluations, allowing users to run each test case multiple times and average the scores across trials for more robust evaluation results.

Key changes:

  • Added trials parameter to EvalParams type to specify the number of times each test case should run (defaults to 1)
  • Introduced new semantic convention attribute ATTR_EVAL_TRIAL_INDEX for telemetry tracking
  • Refactored the evaluation execution logic to wrap test cases in a trial loop, compute averaged scores across trials, and report aggregated results

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 10 comments.

File Description
packages/ai/src/otel/semconv/eval_proposal.ts Adds the ATTR_EVAL_TRIAL_INDEX constant for tracking trial indices in telemetry
packages/ai/src/otel/semconv/attributes.ts Exports the trial index attribute through the Attr.Eval.Trial.Index path
packages/ai/src/evals/eval.types.ts Adds the optional trials parameter to EvalParams type definition with documentation
packages/ai/src/evals/eval.ts Implements the trial loop logic, including score averaging, error handling per trial, and result aggregation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +562 to +570
} catch (e) {
console.log(e);
const error = e as Error;

return {
name: scorerName,
...result,
metadata: { duration, startedAt: start, error: null },
};
},
caseContext,
);
}),
);
const ctx = getEvalContext();
outOfScopeFlags = ctx.outOfScopeFlags || ([] as OutOfScopeFlagAccess[]);

const scores = Object.fromEntries(scoreList.map((s) => [s.name, s]));
lastError = error;
}
Copy link

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error handling in the trial loop silently catches errors and continues to the next trial. If a trial fails, only lastError is recorded, but the loop continues. This means:

  1. If some trials succeed and some fail, only successful trials are averaged (which may be intended)
  2. If all trials fail, the error is handled in the else if (lastError) block

However, there's a potential issue: when a trial error is caught, allOutOfScopeFlags.push(...outOfScopeFlags) at line 561 won't be executed for that trial. This inconsistency should be addressed - either push the flags before the try block ends, or ensure they're collected even on error.

Copilot uses AI. Check for mistakes.
typeof lastTrial.output === 'string'
? lastTrial.output
: JSON.stringify(lastTrial.output),
[Attr.Eval.Case.Scores]: JSON.stringify(averagedScores ? averagedScores : {}),
Copy link

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ternary check averagedScores ? averagedScores : {} is redundant. The averagedScores object is always defined (initialized as an empty object at line 578), so this condition will always be truthy. Consider simplifying to JSON.stringify(averagedScores).

Suggested change
[Attr.Eval.Case.Scores]: JSON.stringify(averagedScores ? averagedScores : {}),
[Attr.Eval.Case.Scores]: JSON.stringify(averagedScores),

Copilot uses AI. Check for mistakes.
scorerValues.length;

averagedScores[scorerName] = {
...scorerValues[0],
Copy link

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When constructing the averaged score, spreading scorerValues[0] copies all properties including the metadata field. However, the metadata (containing duration, startedAt, error) is specific to the first trial and doesn't represent the averaged trials. Consider either:

  1. Omitting metadata from averaged scores
  2. Creating new metadata that reflects all trials (e.g., total duration, list of durations)
  3. Explicitly selecting which fields to copy: { name: scorerValues[0].name, score: avgScore, metadata: scorerValues[0].metadata }

The current implementation may be misleading as the metadata doesn't correspond to the averaged score.

Suggested change
...scorerValues[0],
name: scorerValues[0].name,

Copilot uses AI. Check for mistakes.
trialSpan.setAttributes({
[Attr.Eval.Case.Output]:
typeof output === 'string' ? output : JSON.stringify(output),
[Attr.Eval.Case.Scores]: JSON.stringify(scores ? scores : {}),
Copy link

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ternary check scores ? scores : {} is redundant. The scores object is always defined (created from the scoreList at line 551), so this condition will always be truthy. Consider simplifying to JSON.stringify(scores).

Suggested change
[Attr.Eval.Case.Scores]: JSON.stringify(scores ? scores : {}),
[Attr.Eval.Case.Scores]: JSON.stringify(scores),

Copilot uses AI. Check for mistakes.
pickedFlags: opts.configFlags,
};
} else if (lastError) {
const error: Error = lastError;
Copy link

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The redundant assignment const error: Error = lastError; is unnecessary since lastError is already typed as Error | null and checked to be truthy at line 624. Consider using lastError directly instead of creating a new variable.

Copilot uses AI. Check for mistakes.
Comment on lines +626 to +652
const failedScores: Record<string, ScoreWithName> = {};
for (const scorer of opts.scorers) {
failedScores[scorer.name] = {
name: scorer.name,
score: 0,
metadata: {
duration: 0,
startedAt: start,
error: error.message,
},
};
}

task.meta.case = {
name: evalName,
index: data.index,
expected: data.expected,
input: data.input,
output: String(error),
scores: failedScores,
status: 'fail',
errors: [error],
startedAt: start,
duration: Math.round(performance.now() - start),
outOfScopeFlags,
pickedFlags: opts.configFlags,
};
Copy link

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error handling block is nearly identical to the outer catch block at lines 661-687. The logic for creating failedScores and populating task.meta.case is duplicated. Consider extracting this into a helper function to reduce duplication and improve maintainability.

Copilot uses AI. Check for mistakes.
Comment on lines +628 to +629
failedScores[scorer.name] = {
name: scorer.name,
Copy link

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Direct access to scorer.name may fail if the scorer doesn't have a name property. The code elsewhere uses getScorerName(scorer) helper function (defined at line 171) which provides a fallback. Consider using the same helper here for consistency and to avoid potential undefined values:

const scorerName = getScorerName(scorer);
failedScores[scorerName] = { name: scorerName, ... };
Suggested change
failedScores[scorer.name] = {
name: scorer.name,
const scorerName = getScorerName(scorer);
failedScores[scorerName] = {
name: scorerName,

Copilot uses AI. Check for mistakes.
// Populate scores with error metadata for all scorers that didn't run
const failedScores: Record<string, ScoreWithName> = {};
for (const scorer of opts.scorers) {
failedScores[scorer.name] = {
Copy link

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Direct access to scorer.name may fail if the scorer doesn't have a name property. The code elsewhere uses getScorerName(scorer) helper function (defined at line 171) which provides a fallback. Consider using the same helper here for consistency and to avoid potential undefined values:

const scorerName = getScorerName(scorer);
failedScores[scorerName] = { name: scorerName, ... };

Copilot uses AI. Check for mistakes.
Comment on lines +576 to +623
if (trialResults.length > 0) {
const scorerNames = Object.keys(trialResults[0].scores);
const averagedScores: Record<string, ScoreWithName> = {};

// set task meta for showing result in vitest report
task.meta.case = {
index: data.index,
name: evalName,
expected: data.expected,
input: data.input,
output: output,
scores,
status: 'success',
errors: [],
duration,
startedAt: start,
outOfScopeFlags,
pickedFlags: opts.configFlags,
};
for (const scorerName of scorerNames) {
const scorerValues = trialResults
.map((trial) => trial.scores[scorerName])
.filter((s) => s !== undefined);

// Collect out-of-scope flags for evaluation-level aggregation
allOutOfScopeFlags.push(...outOfScopeFlags);
if (scorerValues.length > 0) {
const avgScore =
scorerValues.reduce((sum, s) => sum + (s.score as number), 0) /
scorerValues.length;

averagedScores[scorerName] = {
...scorerValues[0],
score: avgScore,
};
}
}

const avgDuration =
trialResults.reduce((sum, t) => sum + t.duration, 0) / trialResults.length;

const lastTrial = trialResults[trialResults.length - 1];

caseSpan.setAttributes({
[Attr.Eval.Case.Output]:
typeof lastTrial.output === 'string'
? lastTrial.output
: JSON.stringify(lastTrial.output),
[Attr.Eval.Case.Scores]: JSON.stringify(averagedScores ? averagedScores : {}),
});

task.meta.case = {
index: data.index,
name: evalName,
expected: data.expected,
input: data.input,
output: lastTrial.output,
scores: averagedScores,
status: 'success',
errors: [],
duration: Math.round(avgDuration),
startedAt: start,
outOfScopeFlags,
pickedFlags: opts.configFlags,
};
Copy link

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When no trials complete successfully (trialResults.length === 0) but lastError is also null, no task.meta.case is set. This can happen if an unexpected error bypasses the trial error catching. Consider adding an else clause to handle this edge case or asserting that at least one of the conditions must be true.

Copilot uses AI. Check for mistakes.
});
allOutOfScopeFlags.push(...outOfScopeFlags);
} catch (e) {
console.log(e);
Copy link

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using console.log for error logging is not ideal for production code. Consider using a proper logger or at minimum console.error for error cases. Additionally, the error is logged but then silently swallowed, which may make debugging difficult.

Suggested change
console.log(e);
console.error(e);

Copilot uses AI. Check for mistakes.
@c-ehrlich c-ehrlich marked this pull request as draft November 24, 2025 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant