Skip to content

Python: Foundry Evals integration for Python#4750

Draft
alliscode wants to merge 1 commit intomicrosoft:mainfrom
alliscode:af-foundry-evals-python
Draft

Python: Foundry Evals integration for Python#4750
alliscode wants to merge 1 commit intomicrosoft:mainfrom
alliscode:af-foundry-evals-python

Conversation

@alliscode
Copy link
Member

Add evaluation framework with local and Foundry-hosted evaluator support:

  • EvalItem/EvalResult core types with conversation splitting strategies
  • @evaluator decorator for defining custom evaluation functions
  • LocalEvaluator for running evaluations locally
  • FoundryEvals provider for Azure AI Foundry hosted evaluations
  • evaluate_agent() orchestration with expected values support
  • evaluate_workflow() for multi-agent workflow evaluation
  • Comprehensive test suite and evaluation samples

Contribution Checklist

  • The code builds clean without any errors or warnings
  • The PR follows the Contribution Guidelines
  • All unit tests pass, and I have added new tests where possible
  • Is this a breaking change? If yes, add "[BREAKING]" prefix to the title of the PR.

@markwallace-microsoft markwallace-microsoft added documentation Improvements or additions to documentation python labels Mar 17, 2026
@github-actions github-actions bot changed the title Foundry Evals integration for Python Python: Foundry Evals integration for Python Mar 17, 2026
@alliscode alliscode force-pushed the af-foundry-evals-python branch from a0edd5f to fe9e621 Compare March 17, 2026 21:21
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's call this file _evaluation and include the contents of _local_eval

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done ✅ — merged _eval.py + _local_eval.py into single _evaluation.py. All imports updated across 12 files.

assistant_texts = [m.text for m in response_msgs if m.role == "assistant" and m.text]
return " ".join(assistant_texts).strip()

def to_dict(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be named something else, becuase it is not just a dict, it is a highly specific dict.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to to_eval_data() to better reflect the specific structure it produces.

"""


@dataclass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are putting a awful lot of logic into a dataclass, that is not the intent of dataclasses (at least not how we prefer to use them), so let's either turn into a regular class, or move the helper functions outside of it and ensure they accept a EvalItem object as input.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converted EvalItem from dataclass to regular class with __init__. Helper methods stay on the class since they operate on self.

result = func(*args, **kwargs)
if inspect.isawaitable(result):
return await result
return await asyncio.to_thread(lambda: result)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed — the isawaitable check is sufficient, no need for asyncio.to_thread.



async def _poll_eval_run(
client: OpenAI | AsyncOpenAI,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should limit this to AsyncOpenAI we use async everywhere in AF, so doesn't make much sense to suddenly introduce sync here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — limited to AsyncOpenAI only. Removed sync OpenAI support since AF is async-everywhere.

self,
*,
project_client: Any | None = None,
openai_client: OpenAI | AsyncOpenAI | None = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — same async-only change applied here.

def __init__(
self,
*,
project_client: Any | None = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

project client is a dependency of the core framework, so we can type this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — typed as AIProjectClient from azure.ai.projects.aio under TYPE_CHECKING.

NotImplementedError: The continuous evaluation rules API shape is not
yet finalized.
"""
raise NotImplementedError(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this is not ready, let's remove it for now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the entire setup_continuous_evaluation function.

@alliscode alliscode force-pushed the af-foundry-evals-python branch 5 times, most recently from 8d4289c to 15d8640 Compare March 19, 2026 16:57
Merged and refactored eval module per Eduard's PR review:

- Merge _eval.py + _local_eval.py into single _evaluation.py
- Convert EvalItem from dataclass to regular class
- Rename to_dict() to to_eval_data()
- Convert _AgentEvalData to TypedDict
- Simplify check system: unified async pattern with isawaitable
- Parallelize checks and evaluators with asyncio.gather
- Add all/any mode to tool_called_check
- Fix bool(passed) truthy bug in _coerce_result
- Remove deprecated function_evaluator/async_function_evaluator aliases
- Remove _MinimalAgent, tighten evaluate_agent signature
- Set self.name in __init__ (LocalEvaluator, FoundryEvals)
- Limit FoundryEvals to AsyncOpenAI only
- Type project_client as AIProjectClient
- Remove NotImplementedError continuous eval code
- Add evaluation samples in 02-agents/ and 03-workflows/
- Update all imports and tests (167 passing)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@alliscode alliscode force-pushed the af-foundry-evals-python branch from 15d8640 to aad92ac Compare March 19, 2026 20:41
@markwallace-microsoft
Copy link
Member

Python Test Coverage

Python Test Coverage Report •
FileStmtsMissCoverMissing
packages/azure-ai/agent_framework_azure_ai
   _foundry_evals.py2294978%247, 269, 274–275, 292–296, 303, 306–309, 318–327, 586, 593, 605, 612, 727–728, 730–731, 738, 744–745, 747, 751–754, 756, 763, 770, 813–814, 816, 826, 835, 842
packages/core/agent_framework
   _agents.py3624787%465, 469, 524, 942, 978, 994, 1091–1095, 1150, 1178, 1311, 1327, 1329, 1342, 1348, 1384, 1386, 1395–1400, 1405, 1407, 1413–1414, 1421, 1423–1424, 1432–1433, 1436–1438, 1448–1453, 1457, 1462, 1464
   _evaluation.py6139684%225, 257, 272, 486, 488, 592–593, 672–674, 679, 719–722, 779–780, 783, 789–791, 793, 824–826, 878, 903–918, 920, 922, 1018, 1124, 1424–1425, 1431–1432, 1459, 1461–1464, 1470, 1474–1476, 1480–1482, 1486–1487, 1507–1510, 1512, 1585, 1600, 1604–1606, 1631, 1637–1641, 1675, 1696–1699, 1701, 1703–1705, 1715, 1721–1722, 1724, 1757–1758, 1763
packages/core/agent_framework/_workflows
   _agent_executor.py2082488%97, 113, 168–169, 221–222, 224–225, 255–257, 265–267, 275–277, 279, 283, 287, 394–395, 460, 479
   _workflow.py2701992%88, 269–271, 273–274, 292, 296, 434, 622, 643, 699, 711, 717, 722, 742–744, 757
TOTAL27975336687% 

Python Unit Test Overview

Tests Skipped Failures Errors Time
5443 20 💤 0 ❌ 0 🔥 1m 27s ⏱️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants