GCV to HOCR or PAGE conversion not working #121

OmriPi · 2020-01-27T14:57:12Z

Hi all,

I am new to using this software so please bear with me if this has been asked before or I'm not using the tool correctly.

I have the JSON output of google vision OCR of a PDF (emphasis on PDF and not an image).
I would like to create a searchable version of that PDF using the OCR results. I have tried using gcv2hocr but it doesn't seem to work on PDFs, or it has some other error, because the HOCR output I'm getting from it is basically just the metadata. I tried using ocr-fileformat on the same file, but once again I get only the metadata as a result. Trying to convert it to PAGE fails as well, with the result being some java lines indicating exceptions have occurred. Does ocr-fileformat supports GCV JSON generated from PDF?

The file I'm trying to run it on is the sample file from google:
gs://cloud-samples-data/vision/pdf_tiff/census2010.pdf

And the JSON is generated following this tutorial:
https://cloud.google.com/vision/docs/pdf#vision_text_detection_pdf_gcs-python

If you could assist me or point me in the direction of how to solve it I would be very grateful, as I'm in an urgent need to solve this issue.

Thanks in advance!

kba · 2020-01-28T12:13:32Z

To convert PDF to Google Cloud Vision JSON,, you need to use Google Cloud Vision which is a commercial cloud software we neither support nor endorse. Once you have that JSON data by using their services, you can convert it to hOCR.

kba · 2020-01-28T12:16:13Z

You could also convert to PAGE via hOCR and try https://github.com/PRImA-Research-Lab/prima-page-to-pdf

OmriPi · 2020-01-30T09:17:12Z

Hi @kba , thank you for the answer. I think I may have not explained it correctly, or you misunderstood me:
I have used google vision to get the JSON, I already have it. I am having a problem with using the gcv to HOCR transformer found in this package. When I use it on the JSON I got from google vision, I am getting an almost blank output, with only the metadata.

When I'm trying to convert it to PAGE instead I get this result:

org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog. at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:203) at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177) at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:400) at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:327) at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1472) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:994) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:602) at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:112) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:505) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:842) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:771) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213) at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643) at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:327) at javax.xml.parsers.SAXParser.parse(SAXParser.java:195) at org.primaresearch.dla.page.io.xml.XmlPageReader.parse(XmlPageReader.java:169) at org.primaresearch.dla.page.io.xml.XmlPageReader.read(XmlPageReader.java:130) at org.primaresearch.dla.page.io.xml.PageXmlInputOutput.readPage(PageXmlInputOutput.java:212) at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:192) at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:130) org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog. at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1239) at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643) at org.primaresearch.dla.page.io.xml.XmlPageReader.parse(XmlPageReader.java:204) at org.primaresearch.dla.page.io.xml.XmlPageReader.read(XmlPageReader.java:130) at org.primaresearch.dla.page.io.xml.PageXmlInputOutput.readPage(PageXmlInputOutput.java:212) at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:192) at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:130) Exception in thread "main" java.lang.NullPointerException at org.primaresearch.dla.page.converter.PageConverter.handleNegativeCoordinates(PageConverter.java:389) at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:216) at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:130)

So I'm looking to understand why are the gcv converters in this module not working for me, despite the fact that I have a perfectly viable gcv JSON. I can send you the JSON generated from gcv and you can try for yourself to convert it, if it helps.

Thanks!

OmriPi · 2020-01-30T09:20:14Z

extracted_pdf.pdfoutput-1-to-1.txt

This is the JSON from gcv that I'm using (I changed the suffix into .txt to upload it here), it's a JSON of the sample document that google uses in the tutorial.
Can you try and see if transforming it works correctly for you?
Thanks!

kba · 2020-01-30T09:55:42Z

Then it's best to ask @dinosauria123 (not sure whether they're subscribed to issues here but they should see the mention). The code is at https://github.com/dinosauria123/gcv2hocr

dinosauria123 · 2020-01-31T00:05:55Z

Hi,
If you have problem please open issue at https://github.com/dinosauria123/gcv2hocr.

OmriPi · 2020-02-06T11:15:34Z

Ok @dinosauria123 ! Thanks

sarepal · 2021-03-02T19:37:04Z

Is this issue still live? I'm getting a similar error (org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.) when I try to convert GCV to PAGE. I'm attaching a zip file with the JPG and two versions of GCV: gcv-google-api (which was made with a Python script I wrote to interact with the Google API) and gcv-sh (which was derived from the shell script provided by @dinosauria123 at https://github.com/dinosauria123/gcv2hocr). Thank for your consideration.
gcv-sample.zip

jcuenod · 2021-11-08T20:20:41Z

@sarepal I'm still having issues converting GCV to HOCR and, I could be wrong, I think the conversion to PAGE goes via HOCR. Are you using a result from TEXT_DETECTION or DOCUMENT_TEXT_DETECTION?

OmriPi mentioned this issue Feb 6, 2020

GCV to HOCR or PAGE conversion not working dinosauria123/gcv2hocr#33

Open

kba mentioned this issue Apr 29, 2020

Google Cloud Vision to PAGE-XML #125

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GCV to HOCR or PAGE conversion not working #121

GCV to HOCR or PAGE conversion not working #121

OmriPi commented Jan 27, 2020

kba commented Jan 28, 2020

kba commented Jan 28, 2020

OmriPi commented Jan 30, 2020

OmriPi commented Jan 30, 2020 •

edited

Loading

kba commented Jan 30, 2020

dinosauria123 commented Jan 31, 2020

OmriPi commented Feb 6, 2020

sarepal commented Mar 2, 2021

jcuenod commented Nov 8, 2021

GCV to HOCR or PAGE conversion not working #121

GCV to HOCR or PAGE conversion not working #121

Comments

OmriPi commented Jan 27, 2020

kba commented Jan 28, 2020

kba commented Jan 28, 2020

OmriPi commented Jan 30, 2020

OmriPi commented Jan 30, 2020 • edited Loading

kba commented Jan 30, 2020

dinosauria123 commented Jan 31, 2020

OmriPi commented Feb 6, 2020

sarepal commented Mar 2, 2021

jcuenod commented Nov 8, 2021

OmriPi commented Jan 30, 2020 •

edited

Loading