Skip to content

fix(ocr): keep scanned/image-only marker for mixed scanned+failed PDFs#62

Merged
dev07060 merged 1 commit into
mainfrom
fix/ocr-mixed-pdf-regression
May 29, 2026
Merged

fix(ocr): keep scanned/image-only marker for mixed scanned+failed PDFs#62
dev07060 merged 1 commit into
mainfrom
fix/ocr-mixed-pdf-regression

Conversation

@dev07060

Copy link
Copy Markdown
Owner

Summary

Fixes Finding #1 from the OCR-detection review: a scanned/image-only PDF that also had a page fail to extract took the page-failure error branch, which omits the scanned/image-only marker the Dart layer keys on — so OCR guidance was silently dropped for a recoverable document.

The fix

extract_text_from_pdf now decides the marker via a small predicate, should_include_scanned_marker:

  • keep the marker whenever at least one page parsed successfully (a readable-but-text-less page is the scanned signature OCR can recover), and
  • withhold it only when every page failed to parse (genuine corruption — OCR cannot help).

The classification is intentionally count-based and recall-favoring: we never silently drop a potentially recoverable scanned PDF just because one page is corrupt. Pure-scanned and degenerate zero-page behavior is unchanged.

The below-threshold message construction is extracted into below_threshold_error_message, a pure function, so the exact wording the Dart classifier depends on is unit-testable without a PDF fixture.

Messages emitted (below threshold)

Case Message contains marker?
Pure scanned (0 failed) …; PDF may be scanned/image-only
Mixed (some parsed, some failed) …; N of M page(s) failed to extract (pages: […]); PDF may be scanned/image-only
All pages failed …; N of N page(s) failed to extract (pages: […])

Dart isOcrRequiredPdfExtractionError is unchanged — it already keys on the marker substring, which Rust now emits for the mixed case.

Tests

  • test_should_include_scanned_marker — predicate boundaries (single-page, multi-page, mixed ratios, zero-page, all-failed).
  • test_below_threshold_error_message — pins the exact emitted strings for pure-scanned, mixed (1-of-5, 3-of-5), and all-failed, tying the Rust producer to the Dart matcher contract so the format can't drift undetected across the FFI boundary.
  • Dart: rewrote the stale negative test to a genuine all-failed case; added a mixed scanned+corrupt regression test.
  • The #[ignore]'d scanned-fixture test now also asserts the scanned/image-only marker.

Verification

  • Rust: cargo test --lib -- --test-threads=1147 passed / 0 failed / 10 ignored.
  • Dart: flutter test test/unit59/59 passed; flutter analyze lib/ test/ → clean.

A scanned/image-only PDF that also had a page fail to extract took the
page-failure error branch, which omits the "scanned/image-only" marker the
Dart layer keys on — so OCR guidance was silently dropped for a recoverable
document (Finding #1).

extract_text_from_pdf now picks the marker via should_include_scanned_marker:
keep it whenever at least one page parsed successfully (a readable but
text-less page is the scanned signature OCR recovers), and withhold it only
when every page failed. The classification is intentionally count-based and
recall-favoring, so a scanned PDF with a stray corrupt page is never silently
dropped. The below-threshold message construction is extracted into
below_threshold_error_message, a pure function, so the exact wording the Dart
classifier depends on is unit-testable without a PDF fixture. Pure-scanned and
zero-page behavior is unchanged.

Tests:
- test_should_include_scanned_marker covers the predicate boundaries.
- test_below_threshold_error_message pins the exact emitted strings for
  pure-scanned, mixed (1-of-5, 3-of-5), and all-failed cases, tying the Rust
  producer to the Dart matcher contract so the format cannot drift undetected.
- Dart: rewrote the stale negative test to a genuine all-failed case and added
  a mixed scanned+corrupt regression test.
- The ignored fixture test now also asserts the scanned/image-only marker.
@dev07060 dev07060 merged commit de71467 into main May 29, 2026
6 checks passed
@dev07060 dev07060 self-assigned this May 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant