fix(ocr): keep scanned/image-only marker for mixed scanned+failed PDFs#62
Merged
Conversation
A scanned/image-only PDF that also had a page fail to extract took the page-failure error branch, which omits the "scanned/image-only" marker the Dart layer keys on — so OCR guidance was silently dropped for a recoverable document (Finding #1). extract_text_from_pdf now picks the marker via should_include_scanned_marker: keep it whenever at least one page parsed successfully (a readable but text-less page is the scanned signature OCR recovers), and withhold it only when every page failed. The classification is intentionally count-based and recall-favoring, so a scanned PDF with a stray corrupt page is never silently dropped. The below-threshold message construction is extracted into below_threshold_error_message, a pure function, so the exact wording the Dart classifier depends on is unit-testable without a PDF fixture. Pure-scanned and zero-page behavior is unchanged. Tests: - test_should_include_scanned_marker covers the predicate boundaries. - test_below_threshold_error_message pins the exact emitted strings for pure-scanned, mixed (1-of-5, 3-of-5), and all-failed cases, tying the Rust producer to the Dart matcher contract so the format cannot drift undetected. - Dart: rewrote the stale negative test to a genuine all-failed case and added a mixed scanned+corrupt regression test. - The ignored fixture test now also asserts the scanned/image-only marker.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes Finding #1 from the OCR-detection review: a scanned/image-only PDF that also had a page fail to extract took the page-failure error branch, which omits the
scanned/image-onlymarker the Dart layer keys on — so OCR guidance was silently dropped for a recoverable document.The fix
extract_text_from_pdfnow decides the marker via a small predicate,should_include_scanned_marker:The classification is intentionally count-based and recall-favoring: we never silently drop a potentially recoverable scanned PDF just because one page is corrupt. Pure-scanned and degenerate zero-page behavior is unchanged.
The below-threshold message construction is extracted into
below_threshold_error_message, a pure function, so the exact wording the Dart classifier depends on is unit-testable without a PDF fixture.Messages emitted (below threshold)
…; PDF may be scanned/image-only…; N of M page(s) failed to extract (pages: […]); PDF may be scanned/image-only…; N of N page(s) failed to extract (pages: […])Dart
isOcrRequiredPdfExtractionErroris unchanged — it already keys on the marker substring, which Rust now emits for the mixed case.Tests
test_should_include_scanned_marker— predicate boundaries (single-page, multi-page, mixed ratios, zero-page, all-failed).test_below_threshold_error_message— pins the exact emitted strings for pure-scanned, mixed (1-of-5, 3-of-5), and all-failed, tying the Rust producer to the Dart matcher contract so the format can't drift undetected across the FFI boundary.#[ignore]'d scanned-fixture test now also asserts thescanned/image-onlymarker.Verification
cargo test --lib -- --test-threads=1→ 147 passed / 0 failed / 10 ignored.flutter test test/unit→ 59/59 passed;flutter analyze lib/ test/→ clean.