Is your feature request related to a problem? Please describe.
The azure_ai_evaluator used by Azure AI Foundry's evaluation framework appears to exhibit bias when acting as an LLM judge for Task Completion and Intent Resolution metrics.
In our evaluation dataset, prompts are intentionally gender-neutral (for example, "Who scored the most goals last year?"). The prompt does not specify men's or women's competitions. We evaluate different valid answer sets separately:
- One evaluation set contains men's competition ground truth and corresponding responses.
- Another evaluation set contains women's competition ground truth and corresponding responses.
Despite the responses being correctly aligned with their respective ground truth, the evaluator consistently marks women's answers as incorrect for Task Completion and Intent Resolution. The accompanying rationale often suggests that the women's answers are less relevant or incorrect compared with men's answers, even when they fully satisfy the evaluation criteria.
This behaviour introduces systematic bias into evaluation results and may unfairly penalise systems that provide correct answers about women's sport, female professionals, or other gender-specific domains.
Describe the solution you'd like
Implement safeguards within azure_ai_evaluator to reduce evaluator bias and ensure consistent assessment across demographic groups.
Potential improvements include:
-
Ground-Truth-First Evaluation: Where a response matches the supplied ground truth, the evaluator should prioritise semantic alignment with that ground truth rather than relying on external knowledge or assumptions about the "default" interpretation of the prompt.
-
Demographic Neutrality Safeguards: Introduce evaluation guardrails that detect when demographic attributes (such as gender, ethnicity, or nationality) are influencing a judgement despite not being relevant to the evaluation criteria.
Describe alternatives you've considered
We considered:
- Explicitly specifying "men's" or "women's" in every evaluation prompt.
- Creating separate evaluation pipelines for men's and women's datasets.
- Rewriting prompts to reduce ambiguity.
However, these approaches do not address the underlying issue. The evaluator should be capable of assessing responses against the supplied ground truth without introducing demographic assumptions.
We also considered replacing Task Completion and Intent Resolution evaluations with custom rule-based evaluators. While this may reduce the impact of the issue, it diminishes the value of the built-in Azure AI Foundry evaluation framework and introduces additional implementation and maintenance overhead.
Additional context
Screenshot 1: Benchmark / Evaluation Dataset Configuration
Shows the benchmark setup, including:
- The gender-neutral prompts used in the evaluation set.
- The men's and women's variants of the benchmark.
- Ground truth answers configured for each dataset.
Screenshot 2: Men's Dataset Evaluation
Shows a representative example from the men's benchmark:
- Prompt.
- Ground truth.
- Model response.
- Evaluator rationale.
- Successful Task Completion and Intent Resolution scores.
This serves as the baseline behaviour.
Screenshot 3: Women's Dataset Evaluation
Shows the equivalent example from the women's benchmark:
- Identical prompt structure.
- Women's ground truth.
- Response aligned with that ground truth.
- Failed Task Completion and/or Intent Resolution scores.
This demonstrates the inconsistency in evaluation behaviour.
Examples of problematic explanations:
Finally,
Raising this as a Feature Request as opposed to Bug since this is expected behaviour given the data that these models were initially trained on. That being said, hopefully this can be mitigated at the prompt/harness layer.
Is your feature request related to a problem? Please describe.
The
azure_ai_evaluatorused by Azure AI Foundry's evaluation framework appears to exhibit bias when acting as an LLM judge for Task Completion and Intent Resolution metrics.In our evaluation dataset, prompts are intentionally gender-neutral (for example, "Who scored the most goals last year?"). The prompt does not specify men's or women's competitions. We evaluate different valid answer sets separately:
Despite the responses being correctly aligned with their respective ground truth, the evaluator consistently marks women's answers as incorrect for Task Completion and Intent Resolution. The accompanying rationale often suggests that the women's answers are less relevant or incorrect compared with men's answers, even when they fully satisfy the evaluation criteria.
This behaviour introduces systematic bias into evaluation results and may unfairly penalise systems that provide correct answers about women's sport, female professionals, or other gender-specific domains.
Describe the solution you'd like
Implement safeguards within
azure_ai_evaluatorto reduce evaluator bias and ensure consistent assessment across demographic groups.Potential improvements include:
Ground-Truth-First Evaluation: Where a response matches the supplied ground truth, the evaluator should prioritise semantic alignment with that ground truth rather than relying on external knowledge or assumptions about the "default" interpretation of the prompt.
Demographic Neutrality Safeguards: Introduce evaluation guardrails that detect when demographic attributes (such as gender, ethnicity, or nationality) are influencing a judgement despite not being relevant to the evaluation criteria.
Describe alternatives you've considered
We considered:
However, these approaches do not address the underlying issue. The evaluator should be capable of assessing responses against the supplied ground truth without introducing demographic assumptions.
We also considered replacing Task Completion and Intent Resolution evaluations with custom rule-based evaluators. While this may reduce the impact of the issue, it diminishes the value of the built-in Azure AI Foundry evaluation framework and introduces additional implementation and maintenance overhead.
Additional context
Screenshot 1: Benchmark / Evaluation Dataset Configuration
Shows the benchmark setup, including:
Screenshot 2: Men's Dataset Evaluation
Shows a representative example from the men's benchmark:
This serves as the baseline behaviour.
Screenshot 3: Women's Dataset Evaluation
Shows the equivalent example from the women's benchmark:
This demonstrates the inconsistency in evaluation behaviour.
Examples of problematic explanations:
Finally,
Raising this as a Feature Request as opposed to Bug since this is expected behaviour given the data that these models were initially trained on. That being said, hopefully this can be mitigated at the prompt/harness layer.