DOC: Add note about scanned PDFs and OCR suggestion in extract_text.md #3387

cybercoded · 2025-07-17T09:52:30Z

This documentation update enhances docs/user/extract-text.md by adding a useful {tip} block for users encountering None from page.extract_text().

The new tip explains that if no text is extracted, the page might be a scanned image with no text layer. It also provides a short code snippet that checks for this case and suggests using OCR software like pytesseract.

This addresses a common pain point for users unfamiliar with the limitations of text extraction from image-based PDFs. No functional code changes are included, this is purely a doc improvement.

stefan6419846

Thanks for the PR. Could you please have a look at the test failure?

Additionally, we should probably move the note after the existing note. Besides the code being outside of the note, the note seems to be wrong as well: The type hints for page.extract_text() indicate that we always return a string and never None as stated in your proposal. I actually think that the code snippet is superfluous unless we are able to provide an actual reliable way to check for text operations independently.

cybercoded · 2025-07-21T12:16:06Z

Thanks for the detailed feedback, very helpful!

You're absolutely right about the return type of extract_text() always being a string. I misunderstood that and will revise the note accordingly (and remove the None check suggestion since it’s not technically accurate).

I'll also move the note after the existing {note} block, as suggested.

Regarding the reliability of checking for scanned PDFs, I agree that without a concrete detection mechanism for text content (like checking operators in the content stream), including the example may be misleading. I’ll remove the code snippet unless we decide it's helpful to direct users toward OCR in a more general guidance section.

I’ll also look into the test failure and fix that before updating this PR.

Thanks again, I’ll submit an updated commit shortly.

stefan6419846 · 2025-07-22T07:35:21Z

docs/user/extract-text.md

@@ -39,6 +39,9 @@ very often).

 To limit the size of the content streams to process (and avoid OOM errors in your application), consider
 checking `len(page.get_contents().get_data())` beforehand.
+
+If a PDF page appears to contain only an image (e.g., a scanned document), the extracted text may be minimal or visually empty.


Please use a separate note instead. The content of both notes is unrelated and thus does not belong together.

Thanks for the clarification!

I've now separated the OCR suggestion into its own {note} block, as advised, and kept it distinct from the memory-related note.

Let me know if there’s anything else you’d like adjusted.

cybercoded added 2 commits July 17, 2025 10:44

add tip for handling scanned PDFs in extract_text documentation

454e483

removed # from the front of the tip message

3c29a08

stefan6419846 requested changes Jul 17, 2025

View reviewed changes

cybercoded changed the title ~~add tip for handling scanned PDFs in extract_text documentation~~ docs: add note about scanned PDFs and OCR suggestion in extract_text.md Jul 21, 2025

cybercoded changed the title ~~docs: add note about scanned PDFs and OCR suggestion in extract_text.md~~ doc: add note about scanned PDFs and OCR suggestion in extract_text.md Jul 21, 2025

cybercoded changed the title ~~doc: add note about scanned PDFs and OCR suggestion in extract_text.md~~ DOC: add note about scanned PDFs and OCR suggestion in extract_text.md Jul 21, 2025

DOC: revise note about scanned PDFs and remove incorrect None check

b293342

stefan6419846 changed the title ~~DOC: add note about scanned PDFs and OCR suggestion in extract_text.md~~ DOC: Add note about scanned PDFs and OCR suggestion in extract_text.md Jul 22, 2025

stefan6419846 reviewed Jul 22, 2025

View reviewed changes

docs: add separate OCR note for scanned image PDFs in extract_text.md

fc2bc2d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DOC: Add note about scanned PDFs and OCR suggestion in extract_text.md #3387

DOC: Add note about scanned PDFs and OCR suggestion in extract_text.md #3387

cybercoded commented Jul 17, 2025

Uh oh!

stefan6419846 left a comment

Uh oh!

cybercoded commented Jul 21, 2025

Uh oh!

stefan6419846 Jul 22, 2025

Uh oh!

cybercoded Jul 22, 2025

Uh oh!

Uh oh!

DOC: Add note about scanned PDFs and OCR suggestion in extract_text.md #3387

Are you sure you want to change the base?

DOC: Add note about scanned PDFs and OCR suggestion in extract_text.md #3387

Conversation

cybercoded commented Jul 17, 2025

Uh oh!

stefan6419846 left a comment

Choose a reason for hiding this comment

Uh oh!

cybercoded commented Jul 21, 2025

Uh oh!

stefan6419846 Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

cybercoded Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!