-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spaces in between words #33
Comments
Sure, see email in the profile. Are you using the stable version from hackage? |
Thx, I just sent you an email.
Yes I'm using |
Indeed the magic constant has nothing to do here. Actually the file contains this spaces, but the next glyph starts at the beginning of the space, overriding it. I send details by email. Probably I don't know whether it makes sense to add such heuristics to |
Thank you for investigation and plausible to be skeptical about a heuristic for such cases. I worked with around 100-200 documents, and this specific document is the first with this non-ideal glyph pattern. Currently, my time budget is far overstretched to commit a PR, but we are currently allocating new resources. Maybe in the near future. for this document I will wrap pdftotext, but mainly I will stick with this library. Thanks for developing such! |
Hi,
can I send you a pdf file via email with the problem above? I have a problem when extracting text from a pdf file. On some pages words are broken by spaces. For example:
"Der Bri ga di er be ob ach te te das Spek ta kel grim mig wie ein"
instead of:
"Der Brigadier beobachtete das Spektakel grimmig wie ein"
No problems with
pdftotext version 3.03
.I'm not allowed to upload the pdf publicly.
Greetings
Jan
The text was updated successfully, but these errors were encountered: