Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_pdf fails on specific pdf locally, not through hosted api #63

Open
Ianpwest opened this issue Mar 26, 2024 · 5 comments
Open

read_pdf fails on specific pdf locally, not through hosted api #63

Ianpwest opened this issue Mar 26, 2024 · 5 comments

Comments

@Ianpwest
Copy link

PDF in question:
JTR.pdf

This api call works great
llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all" pdf_url = "JTR.pdf" pdf_reader = LayoutPDFReader(llmsherpa_api_url) doc = pdf_reader.read_pdf(pdf_url)

This local call fails
llmsherpa_api_url = "[http://localhost:5010/api/parseDocument?renderFormat=all"](http://localhost:5010/api/parseDocument?renderFormat=all%22) pdf_url = "JTR.pdf" pdf_reader = LayoutPDFReader(llmsherpa_api_url) doc = pdf_reader.read_pdf(pdf_url)

The local version is running from the latest docker build. Other pdfs work fine. Is there a way to get a better error message? Currently receiving: KeyError: 'return_dict'

I noticed there are other issues open around this error but did not find any matching this case where it works on one and not the other.

I appreciate your time and any insight. Thanks!

@wolfassi123
Copy link

Hey, @Ianpwest did you manage to solve this?

@Ianpwest
Copy link
Author

Hey, @Ianpwest did you manage to solve this?

@wolfassi123 No, there were also some other parsing issues with different character sets. The library is promising but seemingly under supported. No movement on my tickets.

@kiran-nlmatics
Copy link
Collaborator

Hello @Ianpwest, @wolfassi123, I have fixed the issue and seems to be working with the sample PDF provided here. Can you do a pull from the main branch of nlm-ingestor and verify?

@dalmia
Copy link

dalmia commented Jul 16, 2024

Switching to the docker image for nlm-ingestor in this comment worked for me.

@rmast
Copy link

rmast commented Jul 19, 2024

llmsherpa_api_url = "[http://localhost:5010/api/parseDocument?renderFormat=all"]

If you look neatly you'll see that the " and the [ are switched in order in the local call.

The error handling of a nonexistent renderFormat could be better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants