Fall back to rubric scores when pairwise judge is position-swap-inconsistent by steveisok · Pull Request #475 · dotnet/skills

steveisok · 2026-03-30T12:51:40Z

Problem

When the pairwise judge is position-swap-inconsistent (disagrees with itself when A/B positions are swapped), MergeInconsistentResults defaults everything to tie, which PairwiseToQualityScore converts to 0. Since pairwise results override both QualityImprovement (40%) and OverallJudgmentImprovement (30%), this zeros out 70% of the weighted score.

For action-oriented skills that do real work (download symbols, run tools), rubric scoring may show genuine quality improvement (e.g., 2.3/5 → 3.0/5), but the pairwise comparison is noisy enough to flip on position swap. The result: quality credit gets discarded and the overhead penalties from doing valuable work push the score negative.

Concrete example

From the apple-crash-symbolication skill eval on PR #201:

Rubric scored MAUI scenario: baseline 2.3/5, skilled 3.0/5 (+0.7)
Pairwise judge: position-swap-inconsistent → forced to tie → quality = 0
Only the efficiency penalties remained → final score -1.4%
The skill's output actually symbolicated frames the baseline couldn't

Fix

Only override rubric-based scores when the pairwise comparison is position-swap-consistent. When inconsistent, keep the rubric-based quality signal — it's noisier than a consistent pairwise result, but non-zero.

One-line guard: if (pairwiseResult is not null && pairwiseResult.PositionSwapConsistent)

Changes

Comparator.cs: Skip pairwise override when PositionSwapConsistent == false
ComparatorTests.cs: New test KeepsRubricScoresWhenPairwiseIsInconsistent
InvestigatingResults.md: Updated pairwise inconsistency guidance to reflect new behavior

Validation

497/497 tests pass
Ran apple-crash-symbolication eval before/after: MAUI scenario went from -0.014 to +0.217 on a run where pairwise happened to be consistent; the fix provides insurance for runs where it isn't

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

…istent When the pairwise judge disagrees with itself across position swaps, MergeInconsistentResults defaults everything to 'tie', which PairwiseToQualityScore converts to 0. This zeroes out 70% of the weighted score (Quality 40% + Overall 30%), leaving only the efficiency penalties to determine the result. For action-oriented skills that do real work (download symbols, run tools), this means quality improvements measured by rubric scoring get discarded, and the overhead of doing valuable work pushes the score negative. Fix: only override rubric-based scores when pairwise is consistent. When inconsistent, keep the rubric-based quality signal — it's noisier but non-zero. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Adjusts skill-validator scoring so position-swap-inconsistent pairwise judgments no longer zero out the quality signal, preserving rubric/independent-judge improvements when pairwise is unreliable.

Changes:

Comparator.CompareScenario now applies pairwise overrides only when PositionSwapConsistent == true.
Adds a regression test ensuring rubric-based scores are retained when pairwise is inconsistent.
Updates investigation guidance to reflect the new fallback behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
eng/skill-validator/src/Evaluate/Comparator.cs	Guards pairwise override behind `PositionSwapConsistent` to avoid tie→0 overriding rubric/overall improvements.
eng/skill-validator/tests/Evaluate/ComparatorTests.cs	Adds test coverage for the inconsistent-pairwise fallback behavior.
eng/skill-validator/InvestigatingResults.md	Documents how position-swap inconsistency affects which quality signals are used.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-30T12:54:37Z

-        if (pairwiseResult is not null)
+        // Override quality scores with pairwise results when consistent.
+        // When position-swap-inconsistent, keep the rubric-based scores —
+        // zeroing out 70% of the score on a noisy comparison is worse than


The comment hard-codes “70% of the score”, which is derived from the current DefaultWeights (0.40 + 0.30). To avoid this drifting if weights change, consider rephrasing (e.g., “most of the weighted score”) or referencing DefaultWeights rather than embedding the percentage in prose.

Suggested change

// zeroing out 70% of the score on a noisy comparison is worse than

// zeroing out most of the weighted score on a noisy comparison is worse than

ViktorHofer · 2026-03-30T13:10:55Z

Given this change, are we now entirely discarding the result from the pairwise judge when the position-swap is inconsistent? Isn't this a concern? AI tell me ;)

steveisok · 2026-03-30T13:25:32Z

Given this change, are we now entirely discarding the result from the pairwise judge when the position-swap is inconsistent? Isn't this a concern? AI tell me ;)

Yes, because my read is that the judge contradicted itself. Also, calling it a "tie" and scoring 0 impacts 70% of the weighted score. That's... not great.

ViktorHofer · 2026-03-30T13:28:17Z

@adityamandaleeka this is coming from your original implementation. Do you have a recommendation here?

steveisok requested review from JanKrivanek and ViktorHofer as code owners March 30, 2026 12:51

Copilot AI review requested due to automatic review settings March 30, 2026 12:51

Copilot started reviewing on behalf of steveisok March 30, 2026 12:52 View session

Copilot AI reviewed Mar 30, 2026

View reviewed changes

ViktorHofer added the infrastructure label Mar 30, 2026

github-actions bot mentioned this pull request Apr 7, 2026

🏥 Repository Health Dashboard #288

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fall back to rubric scores when pairwise judge is position-swap-inconsistent#475

Fall back to rubric scores when pairwise judge is position-swap-inconsistent#475
steveisok wants to merge 1 commit intomainfrom
fix/pairwise-fallback-scoring

steveisok commented Mar 30, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 30, 2026

Uh oh!

ViktorHofer commented Mar 30, 2026

Uh oh!

steveisok commented Mar 30, 2026

Uh oh!

ViktorHofer commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	// zeroing out 70% of the score on a noisy comparison is worse than
	// zeroing out most of the weighted score on a noisy comparison is worse than

Conversation

steveisok commented Mar 30, 2026

Problem

Concrete example

Fix

Changes

Validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

ViktorHofer commented Mar 30, 2026

Uh oh!

steveisok commented Mar 30, 2026

Uh oh!

ViktorHofer commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants