Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

poor performance compared to raw tesseract #675

Open
imalone opened this issue Jun 10, 2024 · 1 comment
Open

poor performance compared to raw tesseract #675

imalone opened this issue Jun 10, 2024 · 1 comment

Comments

@imalone
Copy link

imalone commented Jun 10, 2024

I've been trying out gimagereader recently and was struggling with it. I thought the problem was tesseract's OCR, but running tesseract directly produces much better results. Here's the start of a sample scanned from a newspaper article, no options, just "tesseract 20240610_094500.jpg 20240610_094500-1":
======
| News

Dalya Alberge

It is a founding document of the
. US and inspired the Declaration ~
of Independence and the purge of

English power from the colonies.
‘But, ironically, George Mason’s
======
[...continues...]

And the start of the same sample scanned in gimagereader (with automatic page segmentation option for tesseract, recognise all, no layout detection or image adjustments):
======
Fi Spe 3 ai ; R Nadia! Os a
pt EAS ar Mi eben ied

ERE a7 CIARA TIGA — Dats diay

Narayan,

Snes 5) 4
i 70
ACN ay aaa

LEN cise 7 i

Dalya! Se Loreey or — in | Washington, |

ie i rE 2a clearer ,al
======

This is on Fedora 41 (beta),
gimagereader-gtk-3.4.2-1.fc40.x86_64
gimagereader-gtk-3.4.2-1.fc40.x86_64

I can see it links tesseract:
$ ldd /usr/bin/gimagereader-gtk|grep tesseract
libtesseract.so.5.3.4 => /lib64/libtesseract.so.5.3.4 (0x00007f002bc00000)

And this is the same as my command line tesseract:
$ rpm -qf /lib64/libtesseract.so.5.3.4
tesseract-5.3.4-4.fc40.x86_64
$ rpm -qf /bin/tesseract
tesseract-5.3.4-4.fc40.x86_64

The file is a jpeg picture taken on a phone, I've tried loading in Gimp, allowing conversion of the embedded colour profile and exporting as jpeg, tiff (lzw) and png. This changes the outputs slightly for both direct tesseract and gimagereader (png and tiff are identical), but the picture remains tesseract extracts a reasonable scan while gimagereader is producing mainly nonsense with a few patches of coherence.

It would be nice to be able to use gimagereader, since the layout detection would be handy (I've tried layout detection and removing any spurious selections and it outputs similar nonsensical output). Any ideas what might be going wrong here?

@imalone
Copy link
Author

imalone commented Jun 10, 2024

20240610_094500

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant