poor performance compared to raw tesseract #675

imalone · 2024-06-10T11:44:55Z

I've been trying out gimagereader recently and was struggling with it. I thought the problem was tesseract's OCR, but running tesseract directly produces much better results. Here's the start of a sample scanned from a newspaper article, no options, just "tesseract 20240610_094500.jpg 20240610_094500-1":
======
| News

Dalya Alberge

It is a founding document of the
. US and inspired the Declaration ~
of Independence and the purge of

English power from the colonies.
‘But, ironically, George Mason’s
======
[...continues...]

And the start of the same sample scanned in gimagereader (with automatic page segmentation option for tesseract, recognise all, no layout detection or image adjustments):
======
Fi Spe 3 ai ; R Nadia! Os a
pt EAS ar Mi eben ied

ERE a7 CIARA TIGA — Dats diay

Narayan,

Snes 5) 4
i 70
ACN ay aaa

LEN cise 7 i

Dalya! Se Loreey or — in | Washington, |

ie i rE 2a clearer ,al
======

This is on Fedora 41 (beta),
gimagereader-gtk-3.4.2-1.fc40.x86_64
gimagereader-gtk-3.4.2-1.fc40.x86_64

I can see it links tesseract:
$ ldd /usr/bin/gimagereader-gtk|grep tesseract
libtesseract.so.5.3.4 => /lib64/libtesseract.so.5.3.4 (0x00007f002bc00000)

And this is the same as my command line tesseract:
$ rpm -qf /lib64/libtesseract.so.5.3.4
tesseract-5.3.4-4.fc40.x86_64
$ rpm -qf /bin/tesseract
tesseract-5.3.4-4.fc40.x86_64

The file is a jpeg picture taken on a phone, I've tried loading in Gimp, allowing conversion of the embedded colour profile and exporting as jpeg, tiff (lzw) and png. This changes the outputs slightly for both direct tesseract and gimagereader (png and tiff are identical), but the picture remains tesseract extracts a reasonable scan while gimagereader is producing mainly nonsense with a few patches of coherence.

It would be nice to be able to use gimagereader, since the layout detection would be handy (I've tried layout detection and removing any spurious selections and it outputs similar nonsensical output). Any ideas what might be going wrong here?

imalone · 2024-06-10T11:48:04Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

poor performance compared to raw tesseract #675

poor performance compared to raw tesseract #675

imalone commented Jun 10, 2024 •

edited

Loading

imalone commented Jun 10, 2024

poor performance compared to raw tesseract #675

poor performance compared to raw tesseract #675

Comments

imalone commented Jun 10, 2024 • edited Loading

imalone commented Jun 10, 2024

imalone commented Jun 10, 2024 •

edited

Loading