Skip to content

Add ADR-0020: Foundry Evals integration design#4731

Open
alliscode wants to merge 1 commit intomicrosoft:mainfrom
alliscode:af-foundry-evals-adr
Open

Add ADR-0020: Foundry Evals integration design#4731
alliscode wants to merge 1 commit intomicrosoft:mainfrom
alliscode:af-foundry-evals-adr

Conversation

@alliscode
Copy link
Member

Captures the design for integrating Azure AI Foundry Evaluations with agent-framework. Key decisions:

  • EvalItem with conversation (list[Message]) as single source of truth
  • query/response derived from configurable conversation split strategies
  • Tools as list[FunctionTool] (including auto-extracted MCP tools)
  • FoundryEvals provider with auto-detection of evaluator capabilities
  • LocalEvaluator with @function_evaluator decorator for local checks
  • Consistent Python/C# APIs: evaluate_agent, evaluate_workflow

Motivation and Context

Description

Contribution Checklist

  • The code builds clean without any errors or warnings
  • The PR follows the Contribution Guidelines
  • All unit tests pass, and I have added new tests where possible
  • Is this a breaking change? If yes, add "[BREAKING]" prefix to the title of the PR.

Copilot AI review requested due to automatic review settings March 16, 2026 18:47
@markwallace-microsoft markwallace-microsoft added the documentation Improvements or additions to documentation label Mar 16, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new Architectural Decision Record (ADR) documenting the proposed architecture for integrating Azure AI Foundry Evaluations with the agent-framework, including shared concepts and intended Python/.NET API shapes.

Changes:

  • Introduces ADR text describing evaluator protocol + orchestration approach, including EvalItem/split strategies and Foundry/local evaluators.
  • Documents intended cross-language (Python/.NET) API parity and .NET MEAI (Microsoft.Extensions.AI.Evaluation) alignment.
  • Adds usage examples for agent/workflow evaluation and mixed evaluator providers.

You can also share your feedback on Copilot code review. Take the survey.

@alliscode alliscode force-pushed the af-foundry-evals-adr branch from 1cd0d6b to 2fecc22 Compare March 16, 2026 21:17
@alliscode alliscode force-pushed the af-foundry-evals-adr branch 2 times, most recently from 39201a4 to deed5e4 Compare March 17, 2026 18:10
agent=agent,
queries=["Plan a 3-day trip to Paris"],
evaluators=evals,
conversation_split=ConversationSplit.FULL, # evaluate entire trajectory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these universal across different evals?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The basic eval protocol is a (query, response) -> BOOL | float. The split strategies are at the MAF layer and do the splitting before the evaluators run.


```python
@function_evaluator
def mentions_city(response: str, expected: str) -> bool:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the expected signature for functions evaluators? because I see two different ones here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added more info about the signature below:

@evaluator uses parameter name injection — the function's parameter names determine what data it receives from the EvalItem. Supported names: query, response, expected, conversation, tools, context. Any combination is valid.


# Tier 3: Full context — inspect conversation and tools
@function_evaluator
def used_tools(conversation: list, tools: list) -> float:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we demonstrate how GAIA would fit into this? It has a lot of knobs, which is probably useful, so would be good to understand if we can accommodate all of those settings with the proposed abstraction.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@alliscode alliscode force-pushed the af-foundry-evals-adr branch 4 times, most recently from 2d1b3f1 to 93ef327 Compare March 17, 2026 20:03
```
# results[0] (FoundryEvals)
EvalResults(status="completed", passed=1, failed=0, total=1)
items[0]: EvalItemResult(query="What's the weather?", scores={"relevance": 5, "coherence": 5})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does an item include the agent response and tool usage for auditing purposes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently it does not. Do you think it needs to?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like there should just for completeness.

)

results = await evaluate_agent(
agent=my_agent,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this invoke the agent to get responses for evaluation? If so, how many time will it invoke the agent?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it does if you provide queries. It will invoke the agent once per query.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds less convincing that it only invokes once. If it only invokes it once, the results can hardly be a performance indicator for the agent. Would it make more sense to say the agent is invoked 10 times and the results indicate that the agent perform well 9 (just a random number) times?

@alliscode alliscode changed the title Add ADR-0018: Foundry Evals integration design Add ADR-0020: Foundry Evals integration design Mar 18, 2026
@alliscode alliscode force-pushed the af-foundry-evals-adr branch 2 times, most recently from 7e6c4f6 to 2e6fa21 Compare March 18, 2026 23:03
3. Manually wire up the correct Foundry data source type (`azure_ai_traces`, `jsonl`, `azure_ai_target_completions`, etc.) depending on their scenario
4. Handle App Insights trace ID queries, response ID collection, and eval polling

Additionally, evaluation is a concern that extends beyond any single provider. Developers may want to use local evaluators (LLM-as-judge, regex, keyword matching), third-party evaluation libraries, or multiple providers in combination. The architecture must support this without creating a Foundry-specific lock-in at the API level.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a list of third-party libraries to support?

@alliscode alliscode force-pushed the af-foundry-evals-adr branch from 2e6fa21 to 1502d83 Compare March 18, 2026 23:30

`EvalItem` is a lightweight record used only by `FunctionEvaluator` and `LocalEvaluator` to pass context to check functions. It is not part of the `IEvaluator` interface:

```csharp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are multi-modal responses to be supported for evalution?

@alliscode alliscode force-pushed the af-foundry-evals-adr branch from 1502d83 to 7ff0361 Compare March 18, 2026 23:38
Captures the design for integrating Azure AI Foundry Evaluations with
agent-framework. Key decisions:

- EvalItem with conversation (list[Message]) as single source of truth
- query/response derived from configurable conversation split strategies
- Tools as list[FunctionTool] (including auto-extracted MCP tools)
- FoundryEvals provider with auto-detection of evaluator capabilities
- LocalEvaluator with @function_evaluator decorator for local checks
- Consistent Python/C# APIs: evaluate_agent, evaluate_workflow

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@alliscode alliscode force-pushed the af-foundry-evals-adr branch from 7ff0361 to aeae780 Compare March 18, 2026 23:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants