Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tika server is running OCR twice ? #76

Open
le-codeur-rapide opened this issue Jul 18, 2024 · 0 comments
Open

Tika server is running OCR twice ? #76

le-codeur-rapide opened this issue Jul 18, 2024 · 0 comments

Comments

@le-codeur-rapide
Copy link

Hello everyone !
First of all thank you for this project, I am using it in my rag application and it is pretty cool !

Looking at the headers we send to the tika server:

def parse_to_html(self, filepath, do_ocr=False):
    # Turn off OCR by default
    timeout = 3000
    headers = {
        "X-Tika-OCRskipOcr": "true"
    }
    if do_ocr:
        headers = {
            "X-Tika-OCRskipOcr": "false",
            "X-Tika-OCRoutputType": "hocr",
            "X-Tika-Timeout-Millis": str(100 * timeout),
            "X-Tika-PDFOcrStrategy": "ocr_only",
            "X-Tika-OCRtimeoutSeconds": str(timeout),
        }

    if ensure_bool(os.environ.get("TIKA_OCR", False)):
        headers = None
    return parser.from_file(filepath, xmlContent=True, requestOptions={'headers': headers, 'timeout': timeout}),` 

I see that We run the pdfocr in each case (do_ocr true or false). I would think that it should be deactivated in case of do_ocr = False and at least be an option when do_ocr = True
I did little experimentation but for do_ocr= true, I have 60 better time performances when I deactivate pdfocr without apparent loss in text extraction. Moreover I can see that the text on images is extracted two times when both tikaocr and pdfocr are activated.
Isn't it better to deactivate pdfOcr by default ?

Or maybe I am missing something ?

Don't hesitate to ask me if I wasn't clear, I'll be happy to contribute if this is not something that was expected behaviour !

Paul

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant