-
Notifications
You must be signed in to change notification settings - Fork 2
feat: trials #140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: trials #140
Conversation
commit: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for running multiple trials per test case in evaluations, allowing users to run each test case multiple times and average the scores across trials for more robust evaluation results.
Key changes:
- Added
trialsparameter toEvalParamstype to specify the number of times each test case should run (defaults to 1) - Introduced new semantic convention attribute
ATTR_EVAL_TRIAL_INDEXfor telemetry tracking - Refactored the evaluation execution logic to wrap test cases in a trial loop, compute averaged scores across trials, and report aggregated results
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 10 comments.
| File | Description |
|---|---|
| packages/ai/src/otel/semconv/eval_proposal.ts | Adds the ATTR_EVAL_TRIAL_INDEX constant for tracking trial indices in telemetry |
| packages/ai/src/otel/semconv/attributes.ts | Exports the trial index attribute through the Attr.Eval.Trial.Index path |
| packages/ai/src/evals/eval.types.ts | Adds the optional trials parameter to EvalParams type definition with documentation |
| packages/ai/src/evals/eval.ts | Implements the trial loop logic, including score averaging, error handling per trial, and result aggregation |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } catch (e) { | ||
| console.log(e); | ||
| const error = e as Error; | ||
|
|
||
| return { | ||
| name: scorerName, | ||
| ...result, | ||
| metadata: { duration, startedAt: start, error: null }, | ||
| }; | ||
| }, | ||
| caseContext, | ||
| ); | ||
| }), | ||
| ); | ||
| const ctx = getEvalContext(); | ||
| outOfScopeFlags = ctx.outOfScopeFlags || ([] as OutOfScopeFlagAccess[]); | ||
|
|
||
| const scores = Object.fromEntries(scoreList.map((s) => [s.name, s])); | ||
| lastError = error; | ||
| } |
Copilot
AI
Nov 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error handling in the trial loop silently catches errors and continues to the next trial. If a trial fails, only lastError is recorded, but the loop continues. This means:
- If some trials succeed and some fail, only successful trials are averaged (which may be intended)
- If all trials fail, the error is handled in the
else if (lastError)block
However, there's a potential issue: when a trial error is caught, allOutOfScopeFlags.push(...outOfScopeFlags) at line 561 won't be executed for that trial. This inconsistency should be addressed - either push the flags before the try block ends, or ensure they're collected even on error.
| typeof lastTrial.output === 'string' | ||
| ? lastTrial.output | ||
| : JSON.stringify(lastTrial.output), | ||
| [Attr.Eval.Case.Scores]: JSON.stringify(averagedScores ? averagedScores : {}), |
Copilot
AI
Nov 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ternary check averagedScores ? averagedScores : {} is redundant. The averagedScores object is always defined (initialized as an empty object at line 578), so this condition will always be truthy. Consider simplifying to JSON.stringify(averagedScores).
| [Attr.Eval.Case.Scores]: JSON.stringify(averagedScores ? averagedScores : {}), | |
| [Attr.Eval.Case.Scores]: JSON.stringify(averagedScores), |
| scorerValues.length; | ||
|
|
||
| averagedScores[scorerName] = { | ||
| ...scorerValues[0], |
Copilot
AI
Nov 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When constructing the averaged score, spreading scorerValues[0] copies all properties including the metadata field. However, the metadata (containing duration, startedAt, error) is specific to the first trial and doesn't represent the averaged trials. Consider either:
- Omitting metadata from averaged scores
- Creating new metadata that reflects all trials (e.g., total duration, list of durations)
- Explicitly selecting which fields to copy:
{ name: scorerValues[0].name, score: avgScore, metadata: scorerValues[0].metadata }
The current implementation may be misleading as the metadata doesn't correspond to the averaged score.
| ...scorerValues[0], | |
| name: scorerValues[0].name, |
| trialSpan.setAttributes({ | ||
| [Attr.Eval.Case.Output]: | ||
| typeof output === 'string' ? output : JSON.stringify(output), | ||
| [Attr.Eval.Case.Scores]: JSON.stringify(scores ? scores : {}), |
Copilot
AI
Nov 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ternary check scores ? scores : {} is redundant. The scores object is always defined (created from the scoreList at line 551), so this condition will always be truthy. Consider simplifying to JSON.stringify(scores).
| [Attr.Eval.Case.Scores]: JSON.stringify(scores ? scores : {}), | |
| [Attr.Eval.Case.Scores]: JSON.stringify(scores), |
| pickedFlags: opts.configFlags, | ||
| }; | ||
| } else if (lastError) { | ||
| const error: Error = lastError; |
Copilot
AI
Nov 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The redundant assignment const error: Error = lastError; is unnecessary since lastError is already typed as Error | null and checked to be truthy at line 624. Consider using lastError directly instead of creating a new variable.
| const failedScores: Record<string, ScoreWithName> = {}; | ||
| for (const scorer of opts.scorers) { | ||
| failedScores[scorer.name] = { | ||
| name: scorer.name, | ||
| score: 0, | ||
| metadata: { | ||
| duration: 0, | ||
| startedAt: start, | ||
| error: error.message, | ||
| }, | ||
| }; | ||
| } | ||
|
|
||
| task.meta.case = { | ||
| name: evalName, | ||
| index: data.index, | ||
| expected: data.expected, | ||
| input: data.input, | ||
| output: String(error), | ||
| scores: failedScores, | ||
| status: 'fail', | ||
| errors: [error], | ||
| startedAt: start, | ||
| duration: Math.round(performance.now() - start), | ||
| outOfScopeFlags, | ||
| pickedFlags: opts.configFlags, | ||
| }; |
Copilot
AI
Nov 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error handling block is nearly identical to the outer catch block at lines 661-687. The logic for creating failedScores and populating task.meta.case is duplicated. Consider extracting this into a helper function to reduce duplication and improve maintainability.
| failedScores[scorer.name] = { | ||
| name: scorer.name, |
Copilot
AI
Nov 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Direct access to scorer.name may fail if the scorer doesn't have a name property. The code elsewhere uses getScorerName(scorer) helper function (defined at line 171) which provides a fallback. Consider using the same helper here for consistency and to avoid potential undefined values:
const scorerName = getScorerName(scorer);
failedScores[scorerName] = { name: scorerName, ... };| failedScores[scorer.name] = { | |
| name: scorer.name, | |
| const scorerName = getScorerName(scorer); | |
| failedScores[scorerName] = { | |
| name: scorerName, |
| // Populate scores with error metadata for all scorers that didn't run | ||
| const failedScores: Record<string, ScoreWithName> = {}; | ||
| for (const scorer of opts.scorers) { | ||
| failedScores[scorer.name] = { |
Copilot
AI
Nov 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Direct access to scorer.name may fail if the scorer doesn't have a name property. The code elsewhere uses getScorerName(scorer) helper function (defined at line 171) which provides a fallback. Consider using the same helper here for consistency and to avoid potential undefined values:
const scorerName = getScorerName(scorer);
failedScores[scorerName] = { name: scorerName, ... };| if (trialResults.length > 0) { | ||
| const scorerNames = Object.keys(trialResults[0].scores); | ||
| const averagedScores: Record<string, ScoreWithName> = {}; | ||
|
|
||
| // set task meta for showing result in vitest report | ||
| task.meta.case = { | ||
| index: data.index, | ||
| name: evalName, | ||
| expected: data.expected, | ||
| input: data.input, | ||
| output: output, | ||
| scores, | ||
| status: 'success', | ||
| errors: [], | ||
| duration, | ||
| startedAt: start, | ||
| outOfScopeFlags, | ||
| pickedFlags: opts.configFlags, | ||
| }; | ||
| for (const scorerName of scorerNames) { | ||
| const scorerValues = trialResults | ||
| .map((trial) => trial.scores[scorerName]) | ||
| .filter((s) => s !== undefined); | ||
|
|
||
| // Collect out-of-scope flags for evaluation-level aggregation | ||
| allOutOfScopeFlags.push(...outOfScopeFlags); | ||
| if (scorerValues.length > 0) { | ||
| const avgScore = | ||
| scorerValues.reduce((sum, s) => sum + (s.score as number), 0) / | ||
| scorerValues.length; | ||
|
|
||
| averagedScores[scorerName] = { | ||
| ...scorerValues[0], | ||
| score: avgScore, | ||
| }; | ||
| } | ||
| } | ||
|
|
||
| const avgDuration = | ||
| trialResults.reduce((sum, t) => sum + t.duration, 0) / trialResults.length; | ||
|
|
||
| const lastTrial = trialResults[trialResults.length - 1]; | ||
|
|
||
| caseSpan.setAttributes({ | ||
| [Attr.Eval.Case.Output]: | ||
| typeof lastTrial.output === 'string' | ||
| ? lastTrial.output | ||
| : JSON.stringify(lastTrial.output), | ||
| [Attr.Eval.Case.Scores]: JSON.stringify(averagedScores ? averagedScores : {}), | ||
| }); | ||
|
|
||
| task.meta.case = { | ||
| index: data.index, | ||
| name: evalName, | ||
| expected: data.expected, | ||
| input: data.input, | ||
| output: lastTrial.output, | ||
| scores: averagedScores, | ||
| status: 'success', | ||
| errors: [], | ||
| duration: Math.round(avgDuration), | ||
| startedAt: start, | ||
| outOfScopeFlags, | ||
| pickedFlags: opts.configFlags, | ||
| }; |
Copilot
AI
Nov 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When no trials complete successfully (trialResults.length === 0) but lastError is also null, no task.meta.case is set. This can happen if an unexpected error bypasses the trial error catching. Consider adding an else clause to handle this edge case or asserting that at least one of the conditions must be true.
| }); | ||
| allOutOfScopeFlags.push(...outOfScopeFlags); | ||
| } catch (e) { | ||
| console.log(e); |
Copilot
AI
Nov 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using console.log for error logging is not ideal for production code. Consider using a proper logger or at minimum console.error for error cases. Additionally, the error is logged but then silently swallowed, which may make debugging difficult.
| console.log(e); | |
| console.error(e); |
No description provided.