You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is not a bug or whish, it is just a thought I was having an hour ago.
I know, this is not the main-poppler repository, but this repo got referenced in Poppler's bugzilla in an issue, that covered the mistmatch of the <b> and <i> elements (i.e. Xerces validator: Unexpected element "b". The content of the parent element type must match "(#PCDATA)".) and, since you work on pdftohtml, and I do have an account already here and am too lazy to sign up there (or the mailing-list) ;-), I raise this here. I hope you don't mind.
The XML, that gets created with the -xml switch could be, as it seems, fully replaced by the hOCR (Wikipedia) microformat, which gets more and more support (Tesseract-OCR supports it out of the box). It is also XML, since they use XHTML. The specification for hOCR-1.2 is here.
I think, it may be a a little less elegant, since they use the <title> element to store absolute positioning and font-config (and other) info, but using it, would mean one XML format less to be maintained (even, if it is a very simple one) and people, who have already written XSL-Transforms could re-use their stylesheets. Maybe, hOCR could even replace the main-output of the HTML pdftohtml creates? The CSS should be easily doable in an XSLT, that may be even included inline into the output.
The text was updated successfully, but these errors were encountered:
Hi!
This is not a bug or whish, it is just a thought I was having an hour ago.
I know, this is not the main-poppler repository, but this repo got referenced in Poppler's bugzilla in an issue, that covered the mistmatch of the
<b>
and<i>
elements (i.e.Xerces validator: Unexpected element "b". The content of the parent element type must match "(#PCDATA)".
) and, since you work onpdftohtml
, and I do have an account already here and am too lazy to sign up there (or the mailing-list) ;-), I raise this here. I hope you don't mind.The XML, that gets created with the
-xml
switch could be, as it seems, fully replaced by the hOCR (Wikipedia) microformat, which gets more and more support (Tesseract-OCR supports it out of the box). It is also XML, since they use XHTML. The specification for hOCR-1.2 is here.I think, it may be a a little less elegant, since they use the
<title>
element to store absolute positioning and font-config (and other) info, but using it, would mean one XML format less to be maintained (even, if it is a very simple one) and people, who have already written XSL-Transforms could re-use their stylesheets. Maybe, hOCR could even replace the main-output of the HTML pdftohtml creates? The CSS should be easily doable in an XSLT, that may be even included inline into the output.The text was updated successfully, but these errors were encountered: