-
Notifications
You must be signed in to change notification settings - Fork 0
feat: better eval reporting #93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
commit: |
* remove NodeSDK from instrumentation setups * uninstall NodeSDK * update docs * better tools
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances evaluation reporting capabilities with improved end-of-run reports, simplified mid-run reporting to prevent log mixing, better baseline handling, and fixes for runner issues.
- Adds comprehensive final report generation with scorecard-style output and baseline comparisons
- Implements per-suite baseline handling instead of global baseline storage
- Introduces flag configuration tracking and comparison with baselines
- Refactors timeout configuration to be environment-variable driven
Reviewed Changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
packages/ai/src/util/dot-path.ts | Adds flattenObject utility for converting nested objects to dot notation |
packages/ai/src/evals/run-vitest.ts | Updates timeout configuration to use environment variable |
packages/ai/src/evals/reporter.ts | Major refactor to support final report generation and per-suite baseline handling |
packages/ai/src/evals/reporter.console-utils.ts | Adds comprehensive reporting utilities including final report formatting |
packages/ai/src/evals/instrument.ts | Fixes instrumentation re-initialization for vitest worker processes |
packages/ai/src/evals/eval.types.ts | Adds new types for flag diffs and out-of-scope flags |
packages/ai/src/evals/eval.ts | Enhances flag configuration capture and improves data collection |
packages/ai/src/evals/eval.service.ts | Improves baseline query and adds flag config mapping |
examples/example-evals-nextjs/test/feature.eval.ts | Updates example evaluation with new names and unused flag |
examples/example-evals-nextjs/src/lib/capabilities/classify-ticket/evaluations/ticket-classification.eval.ts | Adds additional test case for spam classification |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
printTestCaseSuccessOrFailed, | ||
type SuiteData, | ||
} from './reporter.console-utils'; | ||
import { flattenObject } from 'src/util/dot-path'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import path should use relative path instead of 'src/' alias for consistency with other imports in the file.
import { flattenObject } from 'src/util/dot-path'; | |
import { flattenObject } from '../util/dot-path'; |
Copilot uses AI. Check for mistakes.
for (const flag of suite.outOfScopeFlags) { | ||
const lastStackTraceFrame = flag.stackTrace[0]; | ||
const lastStackTraceFnName = lastStackTraceFrame.split(' ').shift(); | ||
const lastStackTraceFile = lastStackTraceFrame.split('/').pop()?.slice(0, -1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Magic number -1 in slice() is unclear. Consider adding a comment explaining that it removes the closing parenthesis, or use a more descriptive approach.
const lastStackTraceFile = lastStackTraceFrame.split('/').pop()?.slice(0, -1); | |
// Remove trailing closing parenthesis from file name, if present | |
const lastStackTraceFile = lastStackTraceFrame.split('/').pop()?.replace(/\)$/, ''); |
Copilot uses AI. Check for mistakes.
[Attr.GenAI.Operation.Name]: 'eval.case', | ||
[Attr.Eval.ID]: evalId, | ||
[Attr.Eval.Name]: evalName, | ||
[Attr.Eval.Version]: evalName, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Attr.Eval.Version]: evalName, | |
[Attr.Eval.Version]: evalVersion, |
this should be eval version right?
[Attr.GenAI.Operation.Name]: 'eval.score', | ||
[Attr.Eval.ID]: evalId, | ||
[Attr.Eval.Name]: evalName, | ||
[Attr.Eval.Version]: evalName, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Attr.Eval.Version]: evalName, | |
[Attr.Eval.Version]: evalVersion, |
In this PR