Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug(pdf): incorrect selected area #20

Open
Kristinita opened this issue Aug 15, 2019 · 4 comments
Open

bug(pdf): incorrect selected area #20

Kristinita opened this issue Aug 15, 2019 · 4 comments

Comments

@Kristinita
Copy link

1. Possibly related issue

#1712.

2. Summary

PDF viewers incorrect select words from PDF, that create by Tesseract.

3. Data

Example files from my book:

4. Steps to rperoduce

I download 64-bit Windows version from here, how described in official Tesseract wiki → in installation process I select Russian (rus) additional language → I install Tesseract → I add path with tesseract.exe as user PATH environment variable → I run command:

tesseract KiraProcessedTIF.tif KiraSuperhero -l rus pdf

5. Expected behavior

For KiraCorrectOCR text select correctly in any program:

KiraCorrectOCR Марк

KiraCorrectOCR самоосмысление

6. Actual behavior

For KiraSuperhero Tesseract select not full word:

KiraSuperhero Марк

KiraSuperhero самоосмысление

It reproduced for any word in KiraSuperhero.

7. Not helped

I reproduce actual behavior for KiraSuperhero in any PDF viewer.

  • Firefox:

Firefox

  • PDF-XChange Editor:

PDF-XChange

8. Environment

  • Windows 10 Enterprise LTSB 64-bit EN
D:\SashaDebugging\KiraGoddess>tesseract --version
tesseract v5.0.0-alpha.20190708
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
 Found AVX2
 Found AVX
 Found SSE
 Found libarchive 3.3.2 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.6 liblz4/1.7.5

Thanks.

@stweil
Copy link
Member

stweil commented Aug 15, 2019

Yes, the positions are not exact. That also happens with the old OCR engine (--psm 0). It is not Windows specific.

@stweil
Copy link
Member

stweil commented Aug 15, 2019

I am afraid there is no fast solution. It looks like this problem is already rather old.

@Kristinita
Copy link
Author

@stweil , Type: Question ❓

If this issue is «duplicate», can you show an issue, regarding which my issue is duplicate?

I want to subscribe to progress for solving on this issue.

Thanks.

@stweil
Copy link
Member

stweil commented Aug 16, 2019

It is a duplicate of tesseract-ocr#1712.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants