Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCV to HOCR or PAGE conversion not working #33

Open
OmriPi opened this issue Feb 6, 2020 · 2 comments
Open

GCV to HOCR or PAGE conversion not working #33

OmriPi opened this issue Feb 6, 2020 · 2 comments

Comments

@OmriPi
Copy link

OmriPi commented Feb 6, 2020

Hi @dinosauria123!
This is the issue I posted on ocr-fileformat: UB-Mannheim/ocr-fileformat#121
As per your request I'm opening the issue here, copying the text:

I have the JSON output of google vision OCR of a PDF (emphasis on PDF and not an image).
I would like to create a searchable version of that PDF using the OCR results. I have tried using gcv2hocr but it doesn't seem to work on PDFs, or it has some other error, because the HOCR output I'm getting from it is basically just the metadata. I tried using ocr-fileformat on the same file, but once again I get only the metadata as a result. Trying to convert it to PAGE fails as well, with the result being some java lines indicating exceptions have occurred. Does ocr-fileformat supports GCV JSON generated from PDF?

The file I'm trying to run it on is the sample file from google:
gs://cloud-samples-data/vision/pdf_tiff/census2010.pdf

And the JSON is generated following this tutorial:
https://cloud.google.com/vision/docs/pdf#vision_text_detection_pdf_gcs-python

If you could assist me or point me in the direction of how to solve it I would be very grateful, as I'm in an urgent need to solve this issue.

I have used google vision to get the JSON, I already have it. I am having a problem with using the gcv to HOCR transformer found in this package. When I use it on the JSON I got from google vision, I am getting an almost blank output, with only the metadata.

When I'm trying to convert it to PAGE instead I get this result:

org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog. at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:203) at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177) at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:400) at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:327) at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1472) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:994) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:602) at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:112) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:505) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:842) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:771) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213) at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643) at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:327) at javax.xml.parsers.SAXParser.parse(SAXParser.java:195) at org.primaresearch.dla.page.io.xml.XmlPageReader.parse(XmlPageReader.java:169) at org.primaresearch.dla.page.io.xml.XmlPageReader.read(XmlPageReader.java:130) at org.primaresearch.dla.page.io.xml.PageXmlInputOutput.readPage(PageXmlInputOutput.java:212) at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:192) at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:130) org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog. at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1239) at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643) at org.primaresearch.dla.page.io.xml.XmlPageReader.parse(XmlPageReader.java:204) at org.primaresearch.dla.page.io.xml.XmlPageReader.read(XmlPageReader.java:130) at org.primaresearch.dla.page.io.xml.PageXmlInputOutput.readPage(PageXmlInputOutput.java:212) at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:192) at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:130) Exception in thread "main" java.lang.NullPointerException at org.primaresearch.dla.page.converter.PageConverter.handleNegativeCoordinates(PageConverter.java:389) at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:216) at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:130)
So I'm looking to understand why are the gcv converters in this module not working for me, despite the fact that I have a perfectly viable gcv JSON. I can send you the JSON generated from gcv and you can try for yourself to convert it, if it helps.

Thanks in advance!

@dinosauria123
Copy link
Owner

dinosauria123 commented Feb 6, 2020

Thank you for using gcv2hocr.

Your problem seems to conversion of pdf to hocr.
I don't have a plan to support conversion pdf to hocr for gcv2hocr.
But this request is twice, I began to think support this conversion....

If you want to convert pdf to searchable, this script may help you.

https://github.com/mah-jp/pdf4search

@OmriPi
Copy link
Author

OmriPi commented Feb 10, 2020

Thank you @dinosauria123!
You're probably one of very few people in the world who are familiar enough with gcv output by now to make sense of it and make it possible! It would be amazing if you could add PDF support to gcv2hocr! I think that would make gcv2hocr even more useful as I have a hunch that more people need to OCR PDFs rather than image files... and hopefully the required change is not so big.
Please consider adding support, I can assist with anything if you decide to do it!

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants