Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeEncodeError - Several Different PDFs #598

Open
nickrallison opened this issue Dec 14, 2024 · 1 comment
Open

UnicodeEncodeError - Several Different PDFs #598

nickrallison opened this issue Dec 14, 2024 · 1 comment
Assignees
Labels
bug Something isn't working PDF parsing

Comments

@nickrallison
Copy link

Bug

Docling parses some pdfs successfully but fails to write the markdown file with the results.
UnicodeEncodeError: 'charmap' codec can't encode character '\u2217' in position 51: character maps to <undefined>

I was able to resolve this for this specific PDF by changing line 1941 in this file under docling_core\types\doc\document.py but the tests failed
image

Steps to reproduce

  1. Download this pdf: https://typeset.io/pdf/computational-challenges-in-bounded-model-checking-44b7toabj9.pdf
  2. docling computational-challenges-in-bounded-model-checking-44b7toabj9.pdf

I've encountered this on other PDFs as well:
https://batch.libretexts.org/print/url=https://math.libretexts.org/Bookshelves/Combinatorics_and_Discrete_Mathematics/Elementary_Foundations%3A_An_Introduction_to_Topics_in_Discrete_Mathematics_(Sylvestre)/03%3A_Boolean_algebra/3.02%3A_Disjunctive_Normal_Form.pdf

Docling version

Docling version: 2.12.0
Docling Core version: 2.9.0
Docling IBM Models version: 3.1.0
Docling Parse version: 3.0.0

Python version

Python 3.11.9

@maxmnemonic
Copy link
Contributor

@nickrallison thanks for pointing this out!
I'll test it more, and if it's not breaking anything (which I think it shouldn't) we'll regenerate tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working PDF parsing
Projects
None yet
Development

No branches or pull requests

3 participants