Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hOCR for pdftohtml? #1

Open
bmix opened this issue Dec 4, 2016 · 0 comments
Open

hOCR for pdftohtml? #1

bmix opened this issue Dec 4, 2016 · 0 comments

Comments

@bmix
Copy link

bmix commented Dec 4, 2016

Hi!

This is not a bug or whish, it is just a thought I was having an hour ago.

I know, this is not the main-poppler repository, but this repo got referenced in Poppler's bugzilla in an issue, that covered the mistmatch of the <b> and <i> elements (i.e. Xerces validator: Unexpected element "b". The content of the parent element type must match "(#PCDATA)".) and, since you work on pdftohtml, and I do have an account already here and am too lazy to sign up there (or the mailing-list) ;-), I raise this here. I hope you don't mind.

The XML, that gets created with the -xml switch could be, as it seems, fully replaced by the hOCR (Wikipedia) microformat, which gets more and more support (Tesseract-OCR supports it out of the box). It is also XML, since they use XHTML. The specification for hOCR-1.2 is here.

I think, it may be a a little less elegant, since they use the <title> element to store absolute positioning and font-config (and other) info, but using it, would mean one XML format less to be maintained (even, if it is a very simple one) and people, who have already written XSL-Transforms could re-use their stylesheets. Maybe, hOCR could even replace the main-output of the HTML pdftohtml creates? The CSS should be easily doable in an XSLT, that may be even included inline into the output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant