Skip to content

IndirectAttackEvaluator not uploading/displaying results in AI Foundry correctly #45639

@jgbradley1

Description

@jgbradley1
  • Package Name: azure.ai.evaluation
  • Package Version: 1.15.3
  • Operating System: MacOS
  • Python Version: 3.12

Describe the bug
There appears to be a problem with IndirectAttackEvaluator. After data has been simulated with query/response pairs, and then passed/uploaded to AI Foundry, the evaluation results do not appear in the Foundry portal even though the results (returned back programmatically) prove that the evaluation ran correctly.

It's not clear whether the problem is with the SDK or Foundry. This is a blocker for all RAI evaluations that rely on indirect jailbreaking using the IndirectAttackEvaluator class.

To Reproduce

import os
from typing import Any, Dict, List, Optional

from azure.ai.evaluation import IndirectAttackEvaluator, evaluate
from azure.ai.evaluation.simulator import IndirectAttackSimulator
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
from openai import AzureOpenAI


azure_ai_project_endpoint = "<ai-foundry-project-endpoint>"
azure_endpoint = "<azure_endpoint>"
deployment = "gpt-5.1"
api_version = "2025-03-01-preview"

# sample application
def call_llm(
    query: str
) -> str:
    token_provider = get_bearer_token_provider(DefaultAzureCredential(), "https://cognitiveservices.azure.com/.default")
    client = AzureOpenAI(
        api_version = api_version,
        azure_endpoint = azure_endpoint,
        azure_ad_token_provider = token_provider,
    )
    result = client.responses.create(
        model = deployment,
        input = query,
    )
    return result.output_text

async def callback(
          messages: List[Dict],
          stream: bool = False,
          session_state: Any = None,
          context: Optional[dict[str, Any]] = None,
    ) -> dict:
    messages_list = messages["messages"]
    query = messages_list[-1]["content"]
    context = None

    # Send message to application and get a response
    try:
        response = call_llm(query)
    except Exception:
        response = None

    # Format response in OpenAI message protocol
    message = {"content": response, "role": "assistant", "context": context}
    messages["messages"].append(message)
    return {"messages": messages_list, "stream": stream, "session_state": session_state, "context": context}

# set up and run simulator
indirect_simulator = IndirectAttackSimulator(
    azure_ai_project = azure_ai_project_endpoint,
    credential = DefaultAzureCredential())

sim_results = await indirect_simulator(
    target=callback,
    max_conversation_turns=3,
    max_simulation_results=5,
)

# save simulated results to file
with open("indirect_jailbreak_example.jsonl", "w") as file:
    file.write(sim_results.to_eval_qr_json_lines())

# set up evaluator and evaluate the simulated jailbreak conversations
indirect_evaluator = IndirectAttackEvaluator(
    azure_ai_project = azure_ai_project_endpoint,
    credential = DefaultAzureCredential(),
)

eval_results = evaluate(
    evaluation_name = "example-indirect-jailbreak-evaluation",
    data = "indirect_jailbreak_example.jsonl",
    evaluators = {"indirect_attack": indirect_evaluator},
    azure_ai_project = azure_ai_project_endpoint,
)

Expected behavior
I expect to see the eval results/scores get reported and summarized correctly in Foundry. Currently no scores are recorded even though the object eval_results shows clear proof that the evaluator ran correctly.

After more testing, this class worked up through the v1.14.0 release. The problem began in the v1.15.0 release.

Screenshots

Image

Additional context
Add any other context about the problem here.

Metadata

Metadata

Labels

EvaluationIssues related to the client library for Azure AI EvaluationService AttentionWorkflow: This issue is responsible by Azure service team.customer-reportedIssues that are reported by GitHub users external to the Azure organization.needs-team-attentionWorkflow: This issue needs attention from Azure service team or SDK teamquestionThe issue doesn't require a change to the product in order to be resolved. Most issues start as that

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions