Fall back to rubric scores when pairwise judge is position-swap-inconsistent#475
Fall back to rubric scores when pairwise judge is position-swap-inconsistent#475
Conversation
…istent When the pairwise judge disagrees with itself across position swaps, MergeInconsistentResults defaults everything to 'tie', which PairwiseToQualityScore converts to 0. This zeroes out 70% of the weighted score (Quality 40% + Overall 30%), leaving only the efficiency penalties to determine the result. For action-oriented skills that do real work (download symbols, run tools), this means quality improvements measured by rubric scoring get discarded, and the overhead of doing valuable work pushes the score negative. Fix: only override rubric-based scores when pairwise is consistent. When inconsistent, keep the rubric-based quality signal — it's noisier but non-zero. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adjusts skill-validator scoring so position-swap-inconsistent pairwise judgments no longer zero out the quality signal, preserving rubric/independent-judge improvements when pairwise is unreliable.
Changes:
Comparator.CompareScenarionow applies pairwise overrides only whenPositionSwapConsistent == true.- Adds a regression test ensuring rubric-based scores are retained when pairwise is inconsistent.
- Updates investigation guidance to reflect the new fallback behavior.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| eng/skill-validator/src/Evaluate/Comparator.cs | Guards pairwise override behind PositionSwapConsistent to avoid tie→0 overriding rubric/overall improvements. |
| eng/skill-validator/tests/Evaluate/ComparatorTests.cs | Adds test coverage for the inconsistent-pairwise fallback behavior. |
| eng/skill-validator/InvestigatingResults.md | Documents how position-swap inconsistency affects which quality signals are used. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (pairwiseResult is not null) | ||
| // Override quality scores with pairwise results when consistent. | ||
| // When position-swap-inconsistent, keep the rubric-based scores — | ||
| // zeroing out 70% of the score on a noisy comparison is worse than |
There was a problem hiding this comment.
The comment hard-codes “70% of the score”, which is derived from the current DefaultWeights (0.40 + 0.30). To avoid this drifting if weights change, consider rephrasing (e.g., “most of the weighted score”) or referencing DefaultWeights rather than embedding the percentage in prose.
| // zeroing out 70% of the score on a noisy comparison is worse than | |
| // zeroing out most of the weighted score on a noisy comparison is worse than |
|
Given this change, are we now entirely discarding the result from the pairwise judge when the position-swap is inconsistent? Isn't this a concern? AI tell me ;) |
Yes, because my read is that the judge contradicted itself. Also, calling it a "tie" and scoring 0 impacts 70% of the weighted score. That's... not great. |
|
@adityamandaleeka this is coming from your original implementation. Do you have a recommendation here? |
Problem
When the pairwise judge is position-swap-inconsistent (disagrees with itself when A/B positions are swapped),
MergeInconsistentResultsdefaults everything totie, whichPairwiseToQualityScoreconverts to 0. Since pairwise results override bothQualityImprovement(40%) andOverallJudgmentImprovement(30%), this zeros out 70% of the weighted score.For action-oriented skills that do real work (download symbols, run tools), rubric scoring may show genuine quality improvement (e.g., 2.3/5 → 3.0/5), but the pairwise comparison is noisy enough to flip on position swap. The result: quality credit gets discarded and the overhead penalties from doing valuable work push the score negative.
Concrete example
From the apple-crash-symbolication skill eval on PR #201:
Fix
Only override rubric-based scores when the pairwise comparison is position-swap-consistent. When inconsistent, keep the rubric-based quality signal — it's noisier than a consistent pairwise result, but non-zero.
One-line guard:
if (pairwiseResult is not null && pairwiseResult.PositionSwapConsistent)Changes
Comparator.cs: Skip pairwise override whenPositionSwapConsistent == falseComparatorTests.cs: New testKeepsRubricScoresWhenPairwiseIsInconsistentInvestigatingResults.md: Updated pairwise inconsistency guidance to reflect new behaviorValidation
Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com