[16.0] pdf content indexing: PyMupdf + Tesseract #431

len-foss · 2023-09-08T15:32:22Z

It integrates PyMuPDF to perform text extraction.

The OCR with tesseract has also been migrated from version 8; to be able to perform extraction on long documents, PyMuPDF is used to split the content into multiple images, as it has a limit on the image size it can process. I've also added the possibility to explicitly change Tesseract's base language with a context key.

I also adds a new module to perform OCR in individual jobs rather than a cron.

[ADD] tests for attachments_to_filesystem

…text

This is more performant and easily split pages to avoid getting into errors with maximum image size of tessearact.

len-foss · 2023-11-07T11:33:16Z

@agent-z28 FYI

github-actions · 2024-06-16T12:28:24Z

There hasn't been any activity on this pull request in the past 4 months, so it has been marked as stale and it will be closed automatically if no further activity occurs in the next 30 days.
If you want this PR to never become stale, please ask a PSC member to apply the "no stale" label.

hbrunn and others added 13 commits September 7, 2023 11:09

[ADD] document_ocr

8d69a35

[FIX] CI

d3a37e5

[ADD] tests for attachments_to_filesystem

[ADD] cap the amount of documents to ocr per cronjob run

5ee5b3f

[FIX] ignore files with unknown mimetype

8ac1ca8

[FIX] use png as for pillow interchange

c10e84e

[IMP] document_ocr: handle invalid data in attachments gracefully

a00735b

[IMP] document_ocr: pre-commit execution

f1f13f1

[MIG] document_ocr -> attachment_indexation_ocr

5edbe16

[IMP] attachment_indexation_ocr: option to pass tesseract lang in con…

73cff87

…text

[IMP] attachment_indexation_ocr: convert pdf with fitz

6196f30

This is more performant and easily split pages to avoid getting into errors with maximum image size of tessearact.

[REF] attachment_indexation_ocr: refactor test class for inheritance

ecadc83

[ADD] attachment_indexation_ocr_job

f31245a

[ADD] attachment_indexation_mupdf

7f394be

len-foss mentioned this pull request Sep 15, 2023

Migration to version 16.0 OCA/dms#213

Open

6 tasks

len-foss added 4 commits February 15, 2024 11:19

[UPD] requirements.txt: add dependency on textract

8d1125a

[ADD] attachment_indexation_textract

1125d29

[UPD] requirements.txt: use textract fork

8e98036

[FIX] attachment_indexation_textract: use a textract fork to fix pip

834573b

github-actions bot added the stale PR/Issue without recent activity, it'll be soon closed automatically. label Jun 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[16.0] pdf content indexing: PyMupdf + Tesseract #431

[16.0] pdf content indexing: PyMupdf + Tesseract #431

len-foss commented Sep 8, 2023

len-foss commented Nov 7, 2023

github-actions bot commented Jun 16, 2024

[16.0] pdf content indexing: PyMupdf + Tesseract #431

Are you sure you want to change the base?

[16.0] pdf content indexing: PyMupdf + Tesseract #431

Conversation

len-foss commented Sep 8, 2023

len-foss commented Nov 7, 2023

github-actions bot commented Jun 16, 2024