feat(eval): add abstention detection and comparison-aware scoring for 'static_json'#401
Open
jatinkumar300403 wants to merge 1 commit into
Open
Conversation
… static_json Add a consistent policy for comparison-only scenarios so the scorer can distinguish between an agent that refuses to answer (abstention) and one that provides a valid comparison in natural language instead of structured JSON. Changes: - Add abstained flag to ScorerResult for downstream filtering - Introduce is_abstained() helper detecting empty or decline-to-answer responses - Grant 0.5 partial credit when gold values appear in plain-text answers - Add tests for abstention, correct comparison, and wrong comparison Signed-off-by: jatinkumar300403 <jatin_johnny@yahoo.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Resolving for issue #396
Description
This PR establishes a consistent scoring policy for comparison-only scenarios in the
static_jsonevaluator. It distinguishes between an agent that legitimately abstains from answering ("I don't know") and one that successfully answers but uses natural language text instead of the strict expected JSON format.Before vs. After Comparison:
{"machine": "Motor_B", "severity": "Zone_D"}, an agent replying"I don't know the answer"and an agent replying"Motor_B is the priority because Zone_D"would BOTH receive a flat0.0score with a "structured answer differs" rationale."I don't know"agent now receives0.0with theabstained=Trueflag and an "agent abstained from answering" rationale."Motor_B ... Zone_D"receives a0.5partial credit score with a "comparison match" rationale, recognizing it found the right entities but missed the strict structure.Type of Change
Industry Relevance
Improves evaluation accuracy for ambiguous or chat-based diagnostic scenarios where agents naturally fall back to prose instead of rigid JSON schemas, preventing false negatives in benchmarking.
Testing & Validation
uv run pytest tests/unitpassed. (Added 3 specific test cases for abstention, correct comparison, and wrong comparison).Checklist