-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
&applyOcr=yes - no OCR taking place (skipping image pages) #50
Comments
Any progress on this issue? We have the same problem. |
I've run locally and stepped through the code. The OCR step seems to be returning an empty HTML body. And when I look at the Tika logs, the tika server is throwing an error when attempting to do the OCR. It may be related: WARN [qtp487764004-32] 21:51:48,577 org.eclipse.jetty.server.handler.ContextHandler Unimplemented getRequestCharacterEncoding() - use org.eclipse.jetty.servlet.ServletContextHandler
INFO [qtp487764004-32] 21:51:48,583 org.apache.tika.server.core.resource.RecursiveMetadataResource /rmeta (autodetecting type)
ERROR [qtp487764004-32] 21:51:48,637 org.apache.pdfbox.pdmodel.font.PDType1Font Can't read the embedded Type1 font AAAAAB+Helvetica
java.io.IOException: Start marker missing
at org.apache.fontbox.pfb.PfbParser.parsePfb(PfbParser.java:147) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.fontbox.pfb.PfbParser.<init>(PfbParser.java:125) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.fontbox.type1.Type1Font.createWithPFB(Type1Font.java:69) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.pdfbox.pdmodel.font.PDType1Font.<init>(PDType1Font.java:247) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:76) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:66) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:966) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:541) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:516) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:155) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:155) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:363) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:155) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:1217) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:238) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:126) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:196) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:167) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:163) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.tika.server.core.resource.TikaResource.parse(TikaResource.java:352) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.tika.server.core.resource.RecursiveMetadataResource.parseMetadata(RecursiveMetadataResource.java:78) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.tika.server.core.resource.RecursiveMetadataResource.parseMetadataToMetadataList(RecursiveMetadataResource.java:190) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.tika.server.core.resource.RecursiveMetadataResource.getMetadata(RecursiveMetadataResource.java:179) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:?]
at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
at org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:201) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:104) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:223) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1384) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:178) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1306) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:129) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:149) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.eclipse.jetty.server.Server.handle(Server.java:563) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.eclipse.jetty.server.HttpChannel$RequestDispatchable.dispatch(HttpChannel.java:1598) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:753) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:501) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:287) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:314) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:100) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.eclipse.jetty.io.SelectableChannelEndPoint$1.run(SelectableChannelEndPoint.java:53) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.runTask(AdaptiveExecutionStrategy.java:421) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.consumeTask(AdaptiveExecutionStrategy.java:390) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.tryProduce(AdaptiveExecutionStrategy.java:277) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.run(AdaptiveExecutionStrategy.java:199) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:411) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:969) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.doRunJob(QueuedThreadPool.java:1194) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1149) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
at java.lang.Thread.run(Thread.java:829) ~[?:?]
should draw image...
should draw image...
|
Attempted to resolve with updated Tika .jar file. See build here: |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I'm using
&applyOcr=yes
, but there's no indication that any OCR is taking place.I'm getting back the HTML from PDF text ok, but pages that are images of (clear) text from my PDFs are completely skipped.
I'm using the latest docker image from the notebook.
thanks
The text was updated successfully, but these errors were encountered: