bug(pdf): incorrect selected area #20

Kristinita · 2019-08-15T16:36:40Z

1. Possibly related issue

#1712.

2. Summary

PDF viewers incorrect select words from PDF, that create by Tesseract.

3. Data

Example files from my book:

KiraProcessedTIF.tif — TIF image
KiraSuperhero.pdf — PDF, that create Tesseract
KiraCorrectOCR.pdf — PDF with correct OCR for comparing

4. Steps to rperoduce

I download 64-bit Windows version from here, how described in official Tesseract wiki → in installation process I select Russian (rus) additional language → I install Tesseract → I add path with tesseract.exe as user PATH environment variable → I run command:

tesseract KiraProcessedTIF.tif KiraSuperhero -l rus pdf

5. Expected behavior

For KiraCorrectOCR text select correctly in any program:

6. Actual behavior

For KiraSuperhero Tesseract select not full word:

It reproduced for any word in KiraSuperhero.

7. Not helped

I reproduce actual behavior for KiraSuperhero in any PDF viewer.

Firefox:

PDF-XChange Editor:

8. Environment

Windows 10 Enterprise LTSB 64-bit EN

D:\SashaDebugging\KiraGoddess>tesseract --version
tesseract v5.0.0-alpha.20190708
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
 Found AVX2
 Found AVX
 Found SSE
 Found libarchive 3.3.2 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.6 liblz4/1.7.5

Thanks.

The text was updated successfully, but these errors were encountered:

stweil · 2019-08-15T18:23:58Z

Yes, the positions are not exact. That also happens with the old OCR engine (--psm 0). It is not Windows specific.

stweil · 2019-08-15T18:27:16Z

I am afraid there is no fast solution. It looks like this problem is already rather old.

Kristinita · 2019-08-16T08:23:51Z

@stweil , Type: Question ❓

If this issue is «duplicate», can you show an issue, regarding which my issue is duplicate?

I want to subscribe to progress for solving on this issue.

Thanks.

stweil · 2019-08-16T08:29:33Z

It is a duplicate of tesseract-ocr#1712.

stweil added the duplicate label Aug 15, 2019

Kristinita mentioned this issue Aug 19, 2019

bug(non-ascii): Cyrillic symbols in generated PDF Kristinita/SashaMiscellaneous#27

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug(pdf): incorrect selected area #20

bug(pdf): incorrect selected area #20

Kristinita commented Aug 15, 2019

stweil commented Aug 15, 2019 •

edited

Loading

stweil commented Aug 15, 2019

Kristinita commented Aug 16, 2019

stweil commented Aug 16, 2019

bug(pdf): incorrect selected area #20

bug(pdf): incorrect selected area #20

Comments

Kristinita commented Aug 15, 2019

1. Possibly related issue

2. Summary

3. Data

4. Steps to rperoduce

5. Expected behavior

6. Actual behavior

7. Not helped

8. Environment

stweil commented Aug 15, 2019 • edited Loading

stweil commented Aug 15, 2019

Kristinita commented Aug 16, 2019

stweil commented Aug 16, 2019

stweil commented Aug 15, 2019 •

edited

Loading