[azure_ai_evaluator] Mitigate Gender Bias for Task Completion and Intent Resolution Evaluations

**Is your feature request related to a problem? Please describe.**
The `azure_ai_evaluator` used by Azure AI Foundry's evaluation framework appears to exhibit bias when acting as an LLM judge for Task Completion and Intent Resolution metrics.

In our evaluation dataset, prompts are intentionally gender-neutral (for example, "Who scored the most goals last year?"). The prompt does not specify men's or women's competitions. We evaluate different valid answer sets separately:

- One evaluation set contains men's competition ground truth and corresponding responses.
- Another evaluation set contains women's competition ground truth and corresponding responses.

Despite the responses being correctly aligned with their respective ground truth, the evaluator consistently marks women's answers as incorrect for Task Completion and Intent Resolution. The accompanying rationale often suggests that the women's answers are less relevant or incorrect compared with men's answers, even when they fully satisfy the evaluation criteria.

This behaviour introduces systematic bias into evaluation results and may unfairly penalise systems that provide correct answers about women's sport, female professionals, or other gender-specific domains.

**Describe the solution you'd like**
Implement safeguards within `azure_ai_evaluator` to reduce evaluator bias and ensure consistent assessment across demographic groups.

Potential improvements include:

1. Ground-Truth-First Evaluation: Where a response matches the supplied ground truth, the evaluator should prioritise semantic alignment with that ground truth rather than relying on external knowledge or assumptions about the "default" interpretation of the prompt.

2. Demographic Neutrality Safeguards: Introduce evaluation guardrails that detect when demographic attributes (such as gender, ethnicity, or nationality) are influencing a judgement despite not being relevant to the evaluation criteria.

**Describe alternatives you've considered**
We considered:

- Explicitly specifying "men's" or "women's" in every evaluation prompt.
- Creating separate evaluation pipelines for men's and women's datasets.
- Rewriting prompts to reduce ambiguity.

However, these approaches do not address the underlying issue. The evaluator should be capable of assessing responses against the supplied ground truth without introducing demographic assumptions.

We also considered replacing Task Completion and Intent Resolution evaluations with custom rule-based evaluators. While this may reduce the impact of the issue, it diminishes the value of the built-in Azure AI Foundry evaluation framework and introduces additional implementation and maintenance overhead.

**Additional context**
Screenshot 1: Benchmark / Evaluation Dataset Configuration
Shows the benchmark setup, including:
- The gender-neutral prompts used in the evaluation set.
- The men's and women's variants of the benchmark.
- Ground truth answers configured for each dataset.

<img width="2856" height="600" alt="Image" src="https://github.com/user-attachments/assets/55d3b9d8-8766-4245-8664-213510350bb7" />

<img width="2871" height="602" alt="Image" src="https://github.com/user-attachments/assets/b33222dc-0fe0-489e-95dc-25f4d5eb7f6c" />

Screenshot 2: Men's Dataset Evaluation
Shows a representative example from the men's benchmark:
- Prompt.
- Ground truth.
- Model response.
- Evaluator rationale.
- Successful Task Completion and Intent Resolution scores.

This serves as the baseline behaviour.

<img width="3199" height="950" alt="Image" src="https://github.com/user-attachments/assets/b10e6bee-1de4-4598-ba61-3e374527e81e" />

Screenshot 3: Women's Dataset Evaluation
Shows the equivalent example from the women's benchmark:
- Identical prompt structure.
- Women's ground truth.
- Response aligned with that ground truth.
- Failed Task Completion and/or Intent Resolution scores.

This demonstrates the inconsistency in evaluation behaviour.

<img width="3193" height="939" alt="Image" src="https://github.com/user-attachments/assets/7295747b-e43d-4fa7-94bb-98ad90befef1" />

Examples of problematic explanations:

<img width="1106" height="406" alt="Image" src="https://github.com/user-attachments/assets/b6407521-09d1-4601-9655-c789f486e19e" />

Finally, 

<img width="391" height="220" alt="Image" src="https://github.com/user-attachments/assets/5f01e3cf-c56e-466c-9357-50a0ec731cac" />

Raising this as a Feature Request as opposed to Bug since this is expected behaviour given the data that these models were initially trained on. That being said, hopefully this can be mitigated at the prompt/harness layer. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[azure_ai_evaluator] Mitigate Gender Bias for Task Completion and Intent Resolution Evaluations #47524

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[azure_ai_evaluator] Mitigate Gender Bias for Task Completion and Intent Resolution Evaluations #47524

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions