Add ADR-0020: Foundry Evals integration design#4731

Open

alliscode wants to merge 1 commit intomicrosoft:mainfrom

alliscode:af-foundry-evals-adr

Member

alliscode commented Mar 16, 2026

Captures the design for integrating Azure AI Foundry Evaluations with agent-framework. Key decisions:

EvalItem with conversation (list[Message]) as single source of truth
query/response derived from configurable conversation split strategies
Tools as list[FunctionTool] (including auto-extracted MCP tools)
FoundryEvals provider with auto-detection of evaluator capabilities
LocalEvaluator with @function_evaluator decorator for local checks
Consistent Python/C# APIs: evaluate_agent, evaluate_workflow

Motivation and Context

Description

Contribution Checklist

The code builds clean without any errors or warnings
The PR follows the Contribution Guidelines
All unit tests pass, and I have added new tests where possible
Is this a breaking change? If yes, add "[BREAKING]" prefix to the title of the PR.

Copilot AI review requested due to automatic review settings

March 16, 2026 18:47

markwallace-microsoft added the documentation label

Copilot started reviewing on behalf of alliscode

March 16, 2026 18:48

Copilot AI reviewed

View reviewed changes

Contributor

Copilot AI left a comment

Pull request overview

Adds a new Architectural Decision Record (ADR) documenting the proposed architecture for integrating Azure AI Foundry Evaluations with the agent-framework, including shared concepts and intended Python/.NET API shapes.

Changes:

Introduces ADR text describing evaluator protocol + orchestration approach, including EvalItem/split strategies and Foundry/local evaluators.
Documents intended cross-language (Python/.NET) API parity and .NET MEAI (Microsoft.Extensions.AI.Evaluation) alignment.
Adds usage examples for agent/workflow evaluation and mixed evaluator providers.

You can also share your feedback on Copilot code review. Take the survey.

docs/decisions/0020-foundry-evals-integration.md Show resolved Hide resolved

docs/decisions/0018-foundry-evals-integration.md Outdated Show resolved Hide resolved

docs/decisions/0020-foundry-evals-integration.md Show resolved Hide resolved

alliscode force-pushed the af-foundry-evals-adr branch from 1cd0d6b to 2fecc22 Compare

March 16, 2026 21:17

moonbox3 reviewed

View reviewed changes

docs/decisions/0020-foundry-evals-integration.md Outdated Show resolved Hide resolved

docs/decisions/0020-foundry-evals-integration.md Outdated Show resolved Hide resolved

docs/decisions/0020-foundry-evals-integration.md Outdated Show resolved Hide resolved

docs/decisions/0020-foundry-evals-integration.md Outdated Show resolved Hide resolved

docs/decisions/0020-foundry-evals-integration.md Outdated Show resolved Hide resolved

alliscode force-pushed the af-foundry-evals-adr branch 2 times, most recently from 39201a4 to deed5e4 Compare

March 17, 2026 18:10

eavanvalkenburg reviewed

View reviewed changes

docs/decisions/0020-foundry-evals-integration.md

+                  agent=agent,
+                  queries=["Plan a 3-day trip to Paris"],
+                  evaluators=evals,
+                  conversation_split=ConversationSplit.FULL,      # evaluate entire trajectory

Member

eavanvalkenburg Mar 17, 2026

are these universal across different evals?

Member Author

alliscode Mar 17, 2026

Yes. The basic eval protocol is a (query, response) -> BOOL | float. The split strategies are at the MAF layer and do the splitting before the evaluators run.

docs/decisions/0020-foundry-evals-integration.md Outdated Show resolved Hide resolved

docs/decisions/0020-foundry-evals-integration.md Outdated

+              ```python
+              @function_evaluator
+              def mentions_city(response: str, expected: str) -> bool:

Member

eavanvalkenburg Mar 17, 2026

what is the expected signature for functions evaluators? because I see two different ones here?

Member Author

alliscode Mar 17, 2026

Added more info about the signature below:

@evaluator uses parameter name injection — the function's parameter names determine what data it receives from the EvalItem. Supported names: query, response, expected, conversation, tools, context. Any combination is valid.

docs/decisions/0020-foundry-evals-integration.md Outdated Show resolved Hide resolved

docs/decisions/0020-foundry-evals-integration.md Outdated Show resolved Hide resolved

docs/decisions/0020-foundry-evals-integration.md

+              # Tier 3: Full context — inspect conversation and tools
+              @function_evaluator
+              def used_tools(conversation: list, tools: list) -> float:

Member

eavanvalkenburg Mar 17, 2026

could we demonstrate how GAIA would fit into this? It has a lot of knobs, which is probably useful, so would be good to understand if we can accommodate all of those settings with the proposed abstraction.

Member Author

alliscode Mar 17, 2026

Done

alliscode force-pushed the af-foundry-evals-adr branch 4 times, most recently from 2d1b3f1 to 93ef327 Compare

March 17, 2026 20:03

TaoChenOSU reviewed

View reviewed changes

docs/decisions/0020-foundry-evals-integration.md Outdated Show resolved Hide resolved

docs/decisions/0020-foundry-evals-integration.md Outdated

+              ```
+              # results[0] (FoundryEvals)
+              EvalResults(status="completed", passed=1, failed=0, total=1)
+                items[0]: EvalItemResult(query="What's the weather?", scores={"relevance": 5, "coherence": 5})

Contributor

TaoChenOSU Mar 17, 2026

Does an item include the agent response and tool usage for auditing purposes?

Member Author

alliscode Mar 18, 2026

Currently it does not. Do you think it needs to?

Contributor

TaoChenOSU Mar 18, 2026

I feel like there should just for completeness.

docs/decisions/0020-foundry-evals-integration.md Outdated Show resolved Hide resolved

docs/decisions/0020-foundry-evals-integration.md

+              )
+              results = await evaluate_agent(
+                  agent=my_agent,

Contributor

TaoChenOSU Mar 17, 2026

Does this invoke the agent to get responses for evaluation? If so, how many time will it invoke the agent?

Member Author

alliscode Mar 17, 2026

Yes it does if you provide queries. It will invoke the agent once per query.

Contributor

TaoChenOSU Mar 18, 2026

It sounds less convincing that it only invokes once. If it only invokes it once, the results can hardly be a performance indicator for the agent. Would it make more sense to say the agent is invoked 10 times and the results indicate that the agent perform well 9 (just a random number) times?

docs/decisions/0020-foundry-evals-integration.md Show resolved Hide resolved

docs/decisions/0020-foundry-evals-integration.md Show resolved Hide resolved

alliscode changed the title ~~Add ADR-0018: Foundry Evals integration design~~ Add ADR-0020: Foundry Evals integration design

alliscode force-pushed the af-foundry-evals-adr branch 2 times, most recently from 7e6c4f6 to 2e6fa21 Compare

March 18, 2026 23:03

crickman reviewed

View reviewed changes

docs/decisions/0020-foundry-evals-integration.md

+. Manually wire up the correct Foundry data source type (`azure_ai_traces`, `jsonl`, `azure_ai_target_completions`, etc.) depending on their scenario
+. Handle App Insights trace ID queries, response ID collection, and eval polling
+              Additionally, evaluation is a concern that extends beyond any single provider. Developers may want to use local evaluators (LLM-as-judge, regex, keyword matching), third-party evaluation libraries, or multiple providers in combination. The architecture must support this without creating a Foundry-specific lock-in at the API level.

Contributor

crickman Mar 18, 2026

Is there a list of third-party libraries to support?

alliscode force-pushed the af-foundry-evals-adr branch from 2e6fa21 to 1502d83 Compare

March 18, 2026 23:30

crickman reviewed

View reviewed changes

docs/decisions/0020-foundry-evals-integration.md


		`EvalItem` is a lightweight record used only by `FunctionEvaluator` and `LocalEvaluator` to pass context to check functions. It is not part of the `IEvaluator` interface:

		```csharp

Contributor

crickman Mar 18, 2026

are multi-modal responses to be supported for evalution?

crickman approved these changes

View reviewed changes

alliscode force-pushed the af-foundry-evals-adr branch from 1502d83 to 7ff0361 Compare

March 18, 2026 23:38


          Add ADR-0020: Foundry Evals integration design

aeae780

Captures the design for integrating Azure AI Foundry Evaluations with
agent-framework. Key decisions:

- EvalItem with conversation (list[Message]) as single source of truth
- query/response derived from configurable conversation split strategies
- Tools as list[FunctionTool] (including auto-extracted MCP tools)
- FoundryEvals provider with auto-detection of evaluator capabilities
- LocalEvaluator with @function_evaluator decorator for local checks
- Consistent Python/C# APIs: evaluate_agent, evaluate_workflow

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

alliscode force-pushed the af-foundry-evals-adr branch from 7ff0361 to aeae780 Compare

March 18, 2026 23:41

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels