hOCR for pdftohtml? #1

bmix · 2016-12-04T18:22:13Z

Hi!

This is not a bug or whish, it is just a thought I was having an hour ago.

I know, this is not the main-poppler repository, but this repo got referenced in Poppler's bugzilla in an issue, that covered the mistmatch of the <b> and <i> elements (i.e. Xerces validator: Unexpected element "b". The content of the parent element type must match "(#PCDATA)".) and, since you work on pdftohtml, and I do have an account already here and am too lazy to sign up there (or the mailing-list) ;-), I raise this here. I hope you don't mind.

The XML, that gets created with the -xml switch could be, as it seems, fully replaced by the hOCR (Wikipedia) microformat, which gets more and more support (Tesseract-OCR supports it out of the box). It is also XML, since they use XHTML. The specification for hOCR-1.2 is here.

I think, it may be a a little less elegant, since they use the <title> element to store absolute positioning and font-config (and other) info, but using it, would mean one XML format less to be maintained (even, if it is a very simple one) and people, who have already written XSL-Transforms could re-use their stylesheets. Maybe, hOCR could even replace the main-output of the HTML pdftohtml creates? The CSS should be easily doable in an XSLT, that may be even included inline into the output.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hOCR for pdftohtml? #1

hOCR for pdftohtml? #1

bmix commented Dec 4, 2016

hOCR for pdftohtml? #1

hOCR for pdftohtml? #1

Comments

bmix commented Dec 4, 2016