Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

&applyOcr=yes - no OCR taking place (skipping image pages) #50

Open
thelazydogsback opened this issue Apr 15, 2024 · 3 comments
Open

&applyOcr=yes - no OCR taking place (skipping image pages) #50

thelazydogsback opened this issue Apr 15, 2024 · 3 comments

Comments

@thelazydogsback
Copy link

thelazydogsback commented Apr 15, 2024

I'm using &applyOcr=yes, but there's no indication that any OCR is taking place.
I'm getting back the HTML from PDF text ok, but pages that are images of (clear) text from my PDFs are completely skipped.
I'm using the latest docker image from the notebook.
thanks

@yagobski
Copy link

Any progress on this issue? We have the same problem.

@jamesvillarrubia
Copy link
Collaborator

I've run locally and stepped through the code. The OCR step seems to be returning an empty HTML body. And when I look at the Tika logs, the tika server is throwing an error when attempting to do the OCR. It may be related:

WARN  [qtp487764004-32] 21:51:48,577 org.eclipse.jetty.server.handler.ContextHandler Unimplemented getRequestCharacterEncoding() - use org.eclipse.jetty.servlet.ServletContextHandler
INFO  [qtp487764004-32] 21:51:48,583 org.apache.tika.server.core.resource.RecursiveMetadataResource /rmeta (autodetecting type)
ERROR [qtp487764004-32] 21:51:48,637 org.apache.pdfbox.pdmodel.font.PDType1Font Can't read the embedded Type1 font AAAAAB+Helvetica
java.io.IOException: Start marker missing
	at org.apache.fontbox.pfb.PfbParser.parsePfb(PfbParser.java:147) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.apache.fontbox.pfb.PfbParser.<init>(PfbParser.java:125) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.apache.fontbox.type1.Type1Font.createWithPFB(Type1Font.java:69) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.apache.pdfbox.pdmodel.font.PDType1Font.<init>(PDType1Font.java:247) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:76) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:66) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:966) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:541) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:516) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:155) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:155) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:363) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:155) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:1217) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:238) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:126) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:196) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:167) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:163) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.apache.tika.server.core.resource.TikaResource.parse(TikaResource.java:352) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.apache.tika.server.core.resource.RecursiveMetadataResource.parseMetadata(RecursiveMetadataResource.java:78) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.apache.tika.server.core.resource.RecursiveMetadataResource.parseMetadataToMetadataList(RecursiveMetadataResource.java:190) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.apache.tika.server.core.resource.RecursiveMetadataResource.getMetadata(RecursiveMetadataResource.java:179) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
	at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:?]
	at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
	at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
	at org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:201) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:104) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:265) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:223) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1384) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:178) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1306) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:129) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:149) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.eclipse.jetty.server.Server.handle(Server.java:563) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.eclipse.jetty.server.HttpChannel$RequestDispatchable.dispatch(HttpChannel.java:1598) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:753) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:501) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:287) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:314) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:100) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.eclipse.jetty.io.SelectableChannelEndPoint$1.run(SelectableChannelEndPoint.java:53) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.runTask(AdaptiveExecutionStrategy.java:421) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.consumeTask(AdaptiveExecutionStrategy.java:390) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.tryProduce(AdaptiveExecutionStrategy.java:277) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.run(AdaptiveExecutionStrategy.java:199) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:411) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:969) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.doRunJob(QueuedThreadPool.java:1194) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1149) ~[tika-server-standard-nlm-modified-2.4.1_v6.jar:2.4.1]
	at java.lang.Thread.run(Thread.java:829) ~[?:?]
should draw image...
should draw image...

@jamesvillarrubia
Copy link
Collaborator

Attempted to resolve with updated Tika .jar file. See build here:

nlmatics/nlm-tika#5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants