Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Special characters are no longer recognized correctly #600

Open
JeandeBalzac opened this issue Dec 15, 2024 · 0 comments
Open

Special characters are no longer recognized correctly #600

JeandeBalzac opened this issue Dec 15, 2024 · 0 comments
Labels
bug Something isn't working PDF parsing

Comments

@JeandeBalzac
Copy link

JeandeBalzac commented Dec 15, 2024

Bug

This bug does not exist until version:
docling 2.9.0
docling-core 2.10.0
docling-ibm-models 2.0.8
docling-parse 2.1.2

I show the bug with two different parts in a pdf.

Here the first example:
even if the header is not depicted below correctly, it is correct. So no worries about this.
However, the glyphs are a big problem.

| | | Shape | Appearance | Appearance | Classification Accuracy (%) | Classification Accuracy (%) | Classification Accuracy (%) | Classification Accuracy (%) | Classification Accuracy (%) |
| | | . | layout type. | using ground truth. | family.(S. 4.1) | breed (S. 4.2) | breed (S. 4.2) | both (S. 4.3) | both (S. 4.3) |

. cat dog hierarchical flat
0 1 glyph[check] - - 94.21 NA NA NA NA
1 2 - Image - 82.56 52.01 40.59 NA 39.64
2 3 - Image + Head - 85.06 60.37 52.10 NA 51.23
3 4 - Image + Head + Body - 87.78 64.27 54.31 NA 54.05
4 5 - Image + Head + Body glyph[check] 88.68 66.12 57.29 NA 56.60
5 6 glyph[check] Image - 94.88 50.27 42.94 42.29 43.30
6 7 glyph[check] Image + Head - 95.07 59.11 54.56 52.78 54.03
7 8 glyph[check] Image + Head + Body - 94.89 63.48 55.68 55.26 56.68
8 9 glyph[check] Image + Head + Body glyph[check] 95.37 66.07 59.18 57.77 59.21

The original looks like:
Screenshot from 2024-12-15 09-15-35
Here you can see a table with special characters such as the check sign. They were recognized correctly in the version without GPU.
Here the second example:

Nikolaos Livathinos, Cesar Berrospi, Maksym Lysak, Viktor Kuropiatnyk, Ahmed Nassar, Andre Carvalho, Michele Dolfi, Christoph Auer, Kasper Dinkla, Peter Staar

W * glyph[circledot] W *4 GLYPH<16> V GLYPH<133> 2 GLYPH<240> 4 GLYPH<239> V * ··· 5 glyph[floorleft] ⁄GLYPH<134> GLYPH<239> glyph[circledot] GLYPH<16> glyph[circledot] GLYPH<134> glyph[turnstileright] glyph[circledot] ⁄GLYPH<134> · GLYPH<16> V 4 GLYPH<239> 4 glyph[turnstileright] 4 -d 5GLYPH<134> V glyph[circledot] dd4GLYPH<23> glyph[circledot] GLYPH<134> glyph[turnstileright] glyph[circledot] GLYPH<226> ··· 52 21)
IBM Research Saumerstrasse 4 8803 Ruschlikon, Switzerland
The original looks like this:
The glyphs come from the topmost line.
There is even a second bug: The ä,ü are not recognized correctly as well. But this was also true in the old versions.
image

Steps to reproduce

I provide you the pdf for the second example.
article.pdf

Docling version

docling 2.12.0
docling-core 2.10.0
docling-ibm-models 3.1.0
docling-parse 3.0.0

Python version

python 3.10

@JeandeBalzac JeandeBalzac added the bug Something isn't working label Dec 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working PDF parsing
Projects
None yet
Development

No branches or pull requests

2 participants