feat(eval): add abstention detection and comparison-aware scoring for 'static_json' by jatinkumar300403 · Pull Request #401 · IBM/AssetOpsBench

jatinkumar300403 · 2026-06-21T06:26:00Z

Resolving for issue #396

Description

This PR establishes a consistent scoring policy for comparison-only scenarios in the static_json evaluator. It distinguishes between an agent that legitimately abstains from answering ("I don't know") and one that successfully answers but uses natural language text instead of the strict expected JSON format.

Before vs. After Comparison:

Before: If the gold answer was {"machine": "Motor_B", "severity": "Zone_D"}, an agent replying "I don't know the answer" and an agent replying "Motor_B is the priority because Zone_D" would BOTH receive a flat 0.0 score with a "structured answer differs" rationale.
After:
- The "I don't know" agent now receives 0.0 with the abstained=True flag and an "agent abstained from answering" rationale.
- The agent answering "Motor_B ... Zone_D" receives a 0.5 partial credit score with a "comparison match" rationale, recognizing it found the right entities but missed the strict structure.

Type of Change

New Benchmark Scenario (Industry/Asset type)
Evaluation Metric / Scorer
Agentic Orchestration Logic (ReAct, Plan-Execute, etc.)
Infrastructure / Tooling Improvement

Industry Relevance

Improves evaluation accuracy for ambiguous or chat-based diagnostic scenarios where agents naturally fall back to prose instead of rigid JSON schemas, preventing false negatives in benchmarking.

Testing & Validation

Unit Tests: uv run pytest tests/unit passed. (Added 3 specific test cases for abstention, correct comparison, and wrong comparison).
Scenario Validation: Verified that the agent can execute the trajectory.
Data Integrity: Checked that no PII or sensitive industrial data is included.

Checklist

My code follows the project's Ruff formatting and linting rules.
I have performed a self-review of my code.
I have updated the documentation (README or /docs) accordingly.
I have signed off my commits (DCO).

… static_json Add a consistent policy for comparison-only scenarios so the scorer can distinguish between an agent that refuses to answer (abstention) and one that provides a valid comparison in natural language instead of structured JSON. Changes: - Add abstained flag to ScorerResult for downstream filtering - Introduce is_abstained() helper detecting empty or decline-to-answer responses - Grant 0.5 partial credit when gold values appear in plain-text answers - Add tests for abstention, correct comparison, and wrong comparison Signed-off-by: jatinkumar300403 <jatin_johnny@yahoo.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): add abstention detection and comparison-aware scoring for 'static_json'#401

feat(eval): add abstention detection and comparison-aware scoring for 'static_json'#401
jatinkumar300403 wants to merge 1 commit into
IBM:mainfrom
jatinkumar300403:feature/comparison-abstention-policy

jatinkumar300403 commented Jun 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jatinkumar300403 commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Industry Relevance

Testing & Validation

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jatinkumar300403 commented Jun 21, 2026 •

edited

Loading