Language-agnostic OCR #640

pavel-denisov-fraunhofer · 2024-12-20T10:48:29Z

pavel-denisov-fraunhofer
Dec 20, 2024

I implemented a modification for the TesseractOcrModel to work with cases when the document language is not known in advance. It uses Tesseract's script detection to detect the script, and then runs an appropriate script OCR model (e.g. "Latin" for English or German). Would you be interested in this feature in Docling? If yes, I could prepare a PR.

dolfim-ibm · 2024-12-20T16:28:38Z

dolfim-ibm
Dec 20, 2024
Maintainer

This is a good idea. I think we might have to review a bit the design for when to use the "auto" language mode, but it would definitely be a nice contribution.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Language-agnostic OCR #640

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Language-agnostic OCR #640

pavel-denisov-fraunhofer Dec 20, 2024

Replies: 1 comment

dolfim-ibm Dec 20, 2024 Maintainer

pavel-denisov-fraunhofer
Dec 20, 2024

dolfim-ibm
Dec 20, 2024
Maintainer