You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I see that We run the pdfocr in each case (do_ocr true or false). I would think that it should be deactivated in case of do_ocr = False and at least be an option when do_ocr = True
I did little experimentation but for do_ocr= true, I have 60 better time performances when I deactivate pdfocr without apparent loss in text extraction. Moreover I can see that the text on images is extracted two times when both tikaocr and pdfocr are activated.
Isn't it better to deactivate pdfOcr by default ?
Or maybe I am missing something ?
Don't hesitate to ask me if I wasn't clear, I'll be happy to contribute if this is not something that was expected behaviour !
Paul
The text was updated successfully, but these errors were encountered:
Hello everyone !
First of all thank you for this project, I am using it in my rag application and it is pretty cool !
Looking at the headers we send to the tika server:
I see that We run the pdfocr in each case (do_ocr true or false). I would think that it should be deactivated in case of do_ocr = False and at least be an option when do_ocr = True
I did little experimentation but for do_ocr= true, I have 60 better time performances when I deactivate pdfocr without apparent loss in text extraction. Moreover I can see that the text on images is extracted two times when both tikaocr and pdfocr are activated.
Isn't it better to deactivate pdfOcr by default ?
Or maybe I am missing something ?
Don't hesitate to ask me if I wasn't clear, I'll be happy to contribute if this is not something that was expected behaviour !
Paul
The text was updated successfully, but these errors were encountered: