Skip to content

DOC: Add note about scanned PDFs and OCR suggestion in extract_text.md #3387

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

cybercoded
Copy link

This documentation update enhances docs/user/extract-text.md by adding a useful {tip} block for users encountering None from page.extract_text().

The new tip explains that if no text is extracted, the page might be a scanned image with no text layer. It also provides a short code snippet that checks for this case and suggests using OCR software like pytesseract.

This addresses a common pain point for users unfamiliar with the limitations of text extraction from image-based PDFs. No functional code changes are included, this is purely a doc improvement.

Copy link
Collaborator

@stefan6419846 stefan6419846 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Could you please have a look at the test failure?

Additionally, we should probably move the note after the existing note. Besides the code being outside of the note, the note seems to be wrong as well: The type hints for page.extract_text() indicate that we always return a string and never None as stated in your proposal. I actually think that the code snippet is superfluous unless we are able to provide an actual reliable way to check for text operations independently.

@cybercoded
Copy link
Author

Thanks for the detailed feedback, very helpful!

You're absolutely right about the return type of extract_text() always being a string. I misunderstood that and will revise the note accordingly (and remove the None check suggestion since it’s not technically accurate).

I'll also move the note after the existing {note} block, as suggested.

Regarding the reliability of checking for scanned PDFs, I agree that without a concrete detection mechanism for text content (like checking operators in the content stream), including the example may be misleading. I’ll remove the code snippet unless we decide it's helpful to direct users toward OCR in a more general guidance section.

I’ll also look into the test failure and fix that before updating this PR.

Thanks again, I’ll submit an updated commit shortly.

@cybercoded cybercoded changed the title add tip for handling scanned PDFs in extract_text documentation docs: add note about scanned PDFs and OCR suggestion in extract_text.md Jul 21, 2025
@cybercoded cybercoded changed the title docs: add note about scanned PDFs and OCR suggestion in extract_text.md doc: add note about scanned PDFs and OCR suggestion in extract_text.md Jul 21, 2025
@cybercoded cybercoded changed the title doc: add note about scanned PDFs and OCR suggestion in extract_text.md DOC: add note about scanned PDFs and OCR suggestion in extract_text.md Jul 21, 2025
@stefan6419846 stefan6419846 changed the title DOC: add note about scanned PDFs and OCR suggestion in extract_text.md DOC: Add note about scanned PDFs and OCR suggestion in extract_text.md Jul 22, 2025
@@ -39,6 +39,9 @@ very often).
To limit the size of the content streams to process (and avoid OOM errors in your application), consider
checking `len(page.get_contents().get_data())` beforehand.
If a PDF page appears to contain only an image (e.g., a scanned document), the extracted text may be minimal or visually empty.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use a separate note instead. The content of both notes is unrelated and thus does not belong together.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification!

I've now separated the OCR suggestion into its own {note} block, as advised, and kept it distinct from the memory-related note.

Let me know if there’s anything else you’d like adjusted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants