-
Notifications
You must be signed in to change notification settings - Fork 1.5k
DOC: Add note about scanned PDFs and OCR suggestion in extract_text.md #3387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. Could you please have a look at the test failure?
Additionally, we should probably move the note after the existing note. Besides the code being outside of the note, the note seems to be wrong as well: The type hints for page.extract_text()
indicate that we always return a string and never None
as stated in your proposal. I actually think that the code snippet is superfluous unless we are able to provide an actual reliable way to check for text operations independently.
Thanks for the detailed feedback, very helpful! You're absolutely right about the return type of extract_text() always being a string. I misunderstood that and will revise the note accordingly (and remove the None check suggestion since it’s not technically accurate). I'll also move the note after the existing {note} block, as suggested. Regarding the reliability of checking for scanned PDFs, I agree that without a concrete detection mechanism for text content (like checking operators in the content stream), including the example may be misleading. I’ll remove the code snippet unless we decide it's helpful to direct users toward OCR in a more general guidance section. I’ll also look into the test failure and fix that before updating this PR. Thanks again, I’ll submit an updated commit shortly. |
@@ -39,6 +39,9 @@ very often). | |||
To limit the size of the content streams to process (and avoid OOM errors in your application), consider | |||
checking `len(page.get_contents().get_data())` beforehand. | |||
If a PDF page appears to contain only an image (e.g., a scanned document), the extracted text may be minimal or visually empty. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use a separate note instead. The content of both notes is unrelated and thus does not belong together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification!
I've now separated the OCR suggestion into its own {note} block, as advised, and kept it distinct from the memory-related note.
Let me know if there’s anything else you’d like adjusted.
This documentation update enhances
docs/user/extract-text.md
by adding a useful{tip}
block for users encounteringNone
frompage.extract_text()
.The new tip explains that if no text is extracted, the page might be a scanned image with no text layer. It also provides a short code snippet that checks for this case and suggests using OCR software like
pytesseract
.This addresses a common pain point for users unfamiliar with the limitations of text extraction from image-based PDFs. No functional code changes are included, this is purely a doc improvement.