Skip to content

feat(eval): add abstention detection and comparison-aware scoring for 'static_json'#401

Open
jatinkumar300403 wants to merge 1 commit into
IBM:mainfrom
jatinkumar300403:feature/comparison-abstention-policy
Open

feat(eval): add abstention detection and comparison-aware scoring for 'static_json'#401
jatinkumar300403 wants to merge 1 commit into
IBM:mainfrom
jatinkumar300403:feature/comparison-abstention-policy

Conversation

@jatinkumar300403

@jatinkumar300403 jatinkumar300403 commented Jun 21, 2026

Copy link
Copy Markdown

Resolving for issue #396

Description

This PR establishes a consistent scoring policy for comparison-only scenarios in the static_json evaluator. It distinguishes between an agent that legitimately abstains from answering ("I don't know") and one that successfully answers but uses natural language text instead of the strict expected JSON format.

Before vs. After Comparison:

  • Before: If the gold answer was {"machine": "Motor_B", "severity": "Zone_D"}, an agent replying "I don't know the answer" and an agent replying "Motor_B is the priority because Zone_D" would BOTH receive a flat 0.0 score with a "structured answer differs" rationale.
  • After:
    • The "I don't know" agent now receives 0.0 with the abstained=True flag and an "agent abstained from answering" rationale.
    • The agent answering "Motor_B ... Zone_D" receives a 0.5 partial credit score with a "comparison match" rationale, recognizing it found the right entities but missed the strict structure.

Type of Change

  • New Benchmark Scenario (Industry/Asset type)
  • Evaluation Metric / Scorer
  • Agentic Orchestration Logic (ReAct, Plan-Execute, etc.)
  • Infrastructure / Tooling Improvement

Industry Relevance

Improves evaluation accuracy for ambiguous or chat-based diagnostic scenarios where agents naturally fall back to prose instead of rigid JSON schemas, preventing false negatives in benchmarking.

Testing & Validation

  • Unit Tests: uv run pytest tests/unit passed. (Added 3 specific test cases for abstention, correct comparison, and wrong comparison).
  • Scenario Validation: Verified that the agent can execute the trajectory.
  • Data Integrity: Checked that no PII or sensitive industrial data is included.

Checklist

  • My code follows the project's Ruff formatting and linting rules.
  • I have performed a self-review of my code.
  • I have updated the documentation (README or /docs) accordingly.
  • I have signed off my commits (DCO).

… static_json

Add a consistent policy for comparison-only scenarios so the scorer can distinguish between an agent that refuses to answer (abstention) and one that provides a valid comparison in natural language instead of structured JSON.

Changes:

- Add abstained flag to ScorerResult for downstream filtering

- Introduce is_abstained() helper detecting empty or decline-to-answer responses

- Grant 0.5 partial credit when gold values appear in plain-text answers

- Add tests for abstention, correct comparison, and wrong comparison

Signed-off-by: jatinkumar300403 <jatin_johnny@yahoo.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant