You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have some .pdf files where the OCR recognition in graphics works perfectly, and the recognized text is also displayed correctly in the search results in the OCR tab, but I cannot find this text or its contents in the search index itself.
Does anyone have an idea why the OCR text does not appear in the search index?
The extracted text tab only contains very poorly recognized text, e.g.
"tems Ltg Am Rohiance 3 5S300 WetterCar"
"Invoice 12345 6 AV"
In the OCR tab the text is correctly recognized:
"Car Systems Ltd Am Rohlande 3 58300 Wetter"
"Invoice 123456 /W"
A search for "123456", for example, returns no results. I'm a bit at a loss right now.
The text was updated successfully, but these errors were encountered:
Hi there,
OSS takes the filename and metadata it directly into the index, but leaves the OCR data to be added later. Thats sone by Apache Tika.
Try using command line to index manually single files see if Tika is at HTTP Error 500.
i made the experience that the service hangs up on processing too much at the time.
Also when low on disk, it stops adding OCR.
Furthermore, there is a parameter somewhere where you can disable double OCR, if you have a better calibrated OCR solution beforehand and then it takes the original OCRed PDF.
By default it takes Google Tesseract in english language.
Make sure you set the ocr language to what your document content language is.
I use Chronoscan with a mix of Tesseract and Nuance.
It avoids unnecessary tokenization (the extra spaces).
Best regards
Andy
I have some .pdf files where the OCR recognition in graphics works perfectly, and the recognized text is also displayed correctly in the search results in the OCR tab, but I cannot find this text or its contents in the search index itself.
Does anyone have an idea why the OCR text does not appear in the search index?
The extracted text tab only contains very poorly recognized text, e.g.
"tems Ltg Am Rohiance 3 5S300 WetterCar"
"Invoice 12345 6 AV"
In the OCR tab the text is correctly recognized:
"Car Systems Ltd Am Rohlande 3 58300 Wetter"
"Invoice 123456 /W"
A search for "123456", for example, returns no results. I'm a bit at a loss right now.
The text was updated successfully, but these errors were encountered: