Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[16.0] pdf content indexing: PyMupdf + Tesseract #431

Open
wants to merge 17 commits into
base: 16.0
Choose a base branch
from

Conversation

len-foss
Copy link
Contributor

@len-foss len-foss commented Sep 8, 2023

It integrates PyMuPDF to perform text extraction.

The OCR with tesseract has also been migrated from version 8; to be able to perform extraction on long documents, PyMuPDF is used to split the content into multiple images, as it has a limit on the image size it can process. I've also added the possibility to explicitly change Tesseract's base language with a context key.

I also adds a new module to perform OCR in individual jobs rather than a cron.

@len-foss len-foss mentioned this pull request Sep 15, 2023
6 tasks
@len-foss
Copy link
Contributor Author

len-foss commented Nov 7, 2023

@agent-z28 FYI

Copy link

There hasn't been any activity on this pull request in the past 4 months, so it has been marked as stale and it will be closed automatically if no further activity occurs in the next 30 days.
If you want this PR to never become stale, please ask a PSC member to apply the "no stale" label.

@github-actions github-actions bot added the stale PR/Issue without recent activity, it'll be soon closed automatically. label Jun 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale PR/Issue without recent activity, it'll be soon closed automatically.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants