Add ADR-0020: Foundry Evals integration design#4731
Add ADR-0020: Foundry Evals integration design#4731alliscode wants to merge 1 commit intomicrosoft:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new Architectural Decision Record (ADR) documenting the proposed architecture for integrating Azure AI Foundry Evaluations with the agent-framework, including shared concepts and intended Python/.NET API shapes.
Changes:
- Introduces ADR text describing evaluator protocol + orchestration approach, including
EvalItem/split strategies and Foundry/local evaluators. - Documents intended cross-language (Python/.NET) API parity and .NET MEAI (
Microsoft.Extensions.AI.Evaluation) alignment. - Adds usage examples for agent/workflow evaluation and mixed evaluator providers.
You can also share your feedback on Copilot code review. Take the survey.
1cd0d6b to
2fecc22
Compare
39201a4 to
deed5e4
Compare
| agent=agent, | ||
| queries=["Plan a 3-day trip to Paris"], | ||
| evaluators=evals, | ||
| conversation_split=ConversationSplit.FULL, # evaluate entire trajectory |
There was a problem hiding this comment.
are these universal across different evals?
There was a problem hiding this comment.
Yes. The basic eval protocol is a (query, response) -> BOOL | float. The split strategies are at the MAF layer and do the splitting before the evaluators run.
|
|
||
| ```python | ||
| @function_evaluator | ||
| def mentions_city(response: str, expected: str) -> bool: |
There was a problem hiding this comment.
what is the expected signature for functions evaluators? because I see two different ones here?
There was a problem hiding this comment.
Added more info about the signature below:
@evaluator uses parameter name injection — the function's parameter names determine what data it receives from the EvalItem. Supported names: query, response, expected, conversation, tools, context. Any combination is valid.
|
|
||
| # Tier 3: Full context — inspect conversation and tools | ||
| @function_evaluator | ||
| def used_tools(conversation: list, tools: list) -> float: |
There was a problem hiding this comment.
could we demonstrate how GAIA would fit into this? It has a lot of knobs, which is probably useful, so would be good to understand if we can accommodate all of those settings with the proposed abstraction.
2d1b3f1 to
93ef327
Compare
| ``` | ||
| # results[0] (FoundryEvals) | ||
| EvalResults(status="completed", passed=1, failed=0, total=1) | ||
| items[0]: EvalItemResult(query="What's the weather?", scores={"relevance": 5, "coherence": 5}) |
There was a problem hiding this comment.
Does an item include the agent response and tool usage for auditing purposes?
There was a problem hiding this comment.
Currently it does not. Do you think it needs to?
There was a problem hiding this comment.
I feel like there should just for completeness.
| ) | ||
|
|
||
| results = await evaluate_agent( | ||
| agent=my_agent, |
There was a problem hiding this comment.
Does this invoke the agent to get responses for evaluation? If so, how many time will it invoke the agent?
There was a problem hiding this comment.
Yes it does if you provide queries. It will invoke the agent once per query.
There was a problem hiding this comment.
It sounds less convincing that it only invokes once. If it only invokes it once, the results can hardly be a performance indicator for the agent. Would it make more sense to say the agent is invoked 10 times and the results indicate that the agent perform well 9 (just a random number) times?
7e6c4f6 to
2e6fa21
Compare
| 3. Manually wire up the correct Foundry data source type (`azure_ai_traces`, `jsonl`, `azure_ai_target_completions`, etc.) depending on their scenario | ||
| 4. Handle App Insights trace ID queries, response ID collection, and eval polling | ||
|
|
||
| Additionally, evaluation is a concern that extends beyond any single provider. Developers may want to use local evaluators (LLM-as-judge, regex, keyword matching), third-party evaluation libraries, or multiple providers in combination. The architecture must support this without creating a Foundry-specific lock-in at the API level. |
There was a problem hiding this comment.
Is there a list of third-party libraries to support?
2e6fa21 to
1502d83
Compare
|
|
||
| `EvalItem` is a lightweight record used only by `FunctionEvaluator` and `LocalEvaluator` to pass context to check functions. It is not part of the `IEvaluator` interface: | ||
|
|
||
| ```csharp |
There was a problem hiding this comment.
are multi-modal responses to be supported for evalution?
1502d83 to
7ff0361
Compare
Captures the design for integrating Azure AI Foundry Evaluations with agent-framework. Key decisions: - EvalItem with conversation (list[Message]) as single source of truth - query/response derived from configurable conversation split strategies - Tools as list[FunctionTool] (including auto-extracted MCP tools) - FoundryEvals provider with auto-detection of evaluator capabilities - LocalEvaluator with @function_evaluator decorator for local checks - Consistent Python/C# APIs: evaluate_agent, evaluate_workflow Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
7ff0361 to
aeae780
Compare
Captures the design for integrating Azure AI Foundry Evaluations with agent-framework. Key decisions:
Motivation and Context
Description
Contribution Checklist