Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export issues #599

Open
HunterZ opened this issue Aug 14, 2022 · 4 comments
Open

Export issues #599

HunterZ opened this issue Aug 14, 2022 · 4 comments

Comments

@HunterZ
Copy link

HunterZ commented Aug 14, 2022

Running into a number of issues trying to export the results of painstakingly fine-tuning the hOCR for a PDF.

First, attempting to export directly to PDF from gImageReader-gtk 3.3.1 under Debian, or from gImageReader-qt latest CI under Windows with the PoDoFp backend results in the following error:
image

I suspect this is because I am using custom OTF fonts that are installed in each OS.

Second, attempting to export from gImageReader-qt latest CI under Windows with the QPrinter backend results in the text getting chopped up and duplicated in weird ways. Compare the gImageReader hOCR tree for my first page with the object list from the exported PDF:
image
image

Third, exporting to ODT from gImageReader-gtk 3.3.1 under Debian (not tested under Windows) results in a couple of issues:

  • Text gets line wrapped if the OCR text doesn't fit perfectly within the defined bounding boxes
  • Individual line alignment gets lost when multiple lines are grouped under a paragraph in the hOCR tree
  • Edit: Everything also seems shifted down (even with a baseline of 0 0), although I can't prove whether this is gImageReader's or LibreOffice's fault:
    image
    image

As things currently stand, I don't see any way to get a viable PDF out of gImageReader, even indirectly via ODT->PDF, because all of the export methods either fail outright, produce garbled output, and/or discard aspects of my painstakingly hand-aligned custom font text.

@MicahBird
Copy link

I'm also experiencing this issue on Fedora, but this line in your issue is key:

I suspect this is because I am using custom OTF fonts that are installed in each OS.

Unfortunately it seems that exporting with custom fonts is finicky, as whenever I try to export with the Sans font family is gives the same The PDF export failed: ePdfError_UnsupportedFontFormat.

However, when exporting with Arial or any font in the Liberation font family, it works! Hope this helps :)

@HunterZ
Copy link
Author

HunterZ commented Oct 29, 2022

I have some more information to share:

First, I tried converting all ODF fonts I'm using to TTF (via FontForge then a Python otf2ttf script) and replacing them in my OS. Unfortunately this didn't fix it, but I was able to narrow things down to two font families.

On a hunch, I used sed to change one of the font names in XML from that of the font family to that of one of the specific weight variants (medium/semibold/bold) - and it worked.

The problem with this workaround is that gImageReader only lets you pick a font family from its GUI, and not a weight variant. Both of these font families have 6 variants: medium/semibold/bold weights, each with regular and italic slant variants.

gImageReader was able to work out the italic variant when I picked a specific weight via XML, but this means that I'll probably have to specify the bold weight via XML hacking whenever I want bold, or the regular weight when I want non-bold.

...or maybe I can use FontForge to rearrange the font family naming to a taxonomy that is hopefully better supported by gImageReader?

@HunterZ
Copy link
Author

HunterZ commented Oct 29, 2022

Another update:

I was able to solve it by using FontForge to rename the medium variants' PostScript Names as follows:

  • XYZ-Medium => XYZ
  • XYZ-MediumItalic => XYZ-Italic

Once I did this, exported, and reinstalled the fonts, gImageReader was able to use the family name to derive regular, italic, bold, and bold+italic variants via its own flags.

The takeaway here is that gImageReader apparently only supports fonts that have a variant whose PS Name has no dashed suffix, which it then uses to derive the corresponding -Italic, -Bold, and -BoldItalic variant names. A font whose "base" variant is -Medium and base italic variant is -MediumItalic just doesn't work.

@manisandro
Copy link
Owner

I suspect this is a limitation in PoDoFo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants