Skip to content

Fall back to rubric scores when pairwise judge is position-swap-inconsistent#475

Open
steveisok wants to merge 1 commit intomainfrom
fix/pairwise-fallback-scoring
Open

Fall back to rubric scores when pairwise judge is position-swap-inconsistent#475
steveisok wants to merge 1 commit intomainfrom
fix/pairwise-fallback-scoring

Conversation

@steveisok
Copy link
Copy Markdown
Member

Problem

When the pairwise judge is position-swap-inconsistent (disagrees with itself when A/B positions are swapped), MergeInconsistentResults defaults everything to tie, which PairwiseToQualityScore converts to 0. Since pairwise results override both QualityImprovement (40%) and OverallJudgmentImprovement (30%), this zeros out 70% of the weighted score.

For action-oriented skills that do real work (download symbols, run tools), rubric scoring may show genuine quality improvement (e.g., 2.3/5 → 3.0/5), but the pairwise comparison is noisy enough to flip on position swap. The result: quality credit gets discarded and the overhead penalties from doing valuable work push the score negative.

Concrete example

From the apple-crash-symbolication skill eval on PR #201:

  • Rubric scored MAUI scenario: baseline 2.3/5, skilled 3.0/5 (+0.7)
  • Pairwise judge: position-swap-inconsistent → forced to tie → quality = 0
  • Only the efficiency penalties remained → final score -1.4%
  • The skill's output actually symbolicated frames the baseline couldn't

Fix

Only override rubric-based scores when the pairwise comparison is position-swap-consistent. When inconsistent, keep the rubric-based quality signal — it's noisier than a consistent pairwise result, but non-zero.

One-line guard: if (pairwiseResult is not null && pairwiseResult.PositionSwapConsistent)

Changes

  • Comparator.cs: Skip pairwise override when PositionSwapConsistent == false
  • ComparatorTests.cs: New test KeepsRubricScoresWhenPairwiseIsInconsistent
  • InvestigatingResults.md: Updated pairwise inconsistency guidance to reflect new behavior

Validation

  • 497/497 tests pass
  • Ran apple-crash-symbolication eval before/after: MAUI scenario went from -0.014 to +0.217 on a run where pairwise happened to be consistent; the fix provides insurance for runs where it isn't

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

…istent

When the pairwise judge disagrees with itself across position swaps,
MergeInconsistentResults defaults everything to 'tie', which
PairwiseToQualityScore converts to 0. This zeroes out 70% of the
weighted score (Quality 40% + Overall 30%), leaving only the efficiency
penalties to determine the result.

For action-oriented skills that do real work (download symbols, run
tools), this means quality improvements measured by rubric scoring get
discarded, and the overhead of doing valuable work pushes the score
negative.

Fix: only override rubric-based scores when pairwise is consistent.
When inconsistent, keep the rubric-based quality signal — it's noisier
but non-zero.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 30, 2026 12:51
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adjusts skill-validator scoring so position-swap-inconsistent pairwise judgments no longer zero out the quality signal, preserving rubric/independent-judge improvements when pairwise is unreliable.

Changes:

  • Comparator.CompareScenario now applies pairwise overrides only when PositionSwapConsistent == true.
  • Adds a regression test ensuring rubric-based scores are retained when pairwise is inconsistent.
  • Updates investigation guidance to reflect the new fallback behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
eng/skill-validator/src/Evaluate/Comparator.cs Guards pairwise override behind PositionSwapConsistent to avoid tie→0 overriding rubric/overall improvements.
eng/skill-validator/tests/Evaluate/ComparatorTests.cs Adds test coverage for the inconsistent-pairwise fallback behavior.
eng/skill-validator/InvestigatingResults.md Documents how position-swap inconsistency affects which quality signals are used.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

if (pairwiseResult is not null)
// Override quality scores with pairwise results when consistent.
// When position-swap-inconsistent, keep the rubric-based scores —
// zeroing out 70% of the score on a noisy comparison is worse than
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment hard-codes “70% of the score”, which is derived from the current DefaultWeights (0.40 + 0.30). To avoid this drifting if weights change, consider rephrasing (e.g., “most of the weighted score”) or referencing DefaultWeights rather than embedding the percentage in prose.

Suggested change
// zeroing out 70% of the score on a noisy comparison is worse than
// zeroing out most of the weighted score on a noisy comparison is worse than

Copilot uses AI. Check for mistakes.
@ViktorHofer
Copy link
Copy Markdown
Member

Given this change, are we now entirely discarding the result from the pairwise judge when the position-swap is inconsistent? Isn't this a concern? AI tell me ;)

@steveisok
Copy link
Copy Markdown
Member Author

Given this change, are we now entirely discarding the result from the pairwise judge when the position-swap is inconsistent? Isn't this a concern? AI tell me ;)

Yes, because my read is that the judge contradicted itself. Also, calling it a "tie" and scoring 0 impacts 70% of the weighted score. That's... not great.

@ViktorHofer
Copy link
Copy Markdown
Member

@adityamandaleeka this is coming from your original implementation. Do you have a recommendation here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants