Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error parsing Table #712

Open
pbonito opened this issue Jan 8, 2025 · 7 comments
Open

Error parsing Table #712

pbonito opened this issue Jan 8, 2025 · 7 comments
Assignees
Labels
bug Something isn't working PDF parsing

Comments

@pbonito
Copy link

pbonito commented Jan 8, 2025

Bug

Docling fails parsing table in pdf.
Table seems to be parsed correctly but instead of number, chars are associated to cells.
i.e. 50,945 -> /five.tf/zero.tf,/nine.tf/four.tf/five.tf
Table was parsed correctly with version 2.8.
In both version some text missing at the end of the pages.
...

Steps to reproduce

Parse following doc:
example.pdf

Results:

Other associated companies

In March 2021, Daimler Financial Services Investment

Due to the business development of BAIC Motor Corporation Ltd. (BAIC Motor) , the Group recognized an impairment of € 120 million on the carrying amount of its investment in BAIC Motor in the fourth quarter of 2021. The expenses were included in the line item gains/losses on equity-method investments. The investment is reported in the reconciliation of the reportable segments of the Group.

Company LLC sold all its shares in Via Transportation Inc. , United States to external shareholders. The sale resulted in income before taxes of € 89 million, which was reported in the line item gains/losses on equitymethod investments, net. The company had been allocated to the Mercedes-Benz Mobility segment.

Table D.43 shows summarized aggregated financial information according to IFRS for the significant associated companies accounted for using the equity method after purchase price allocation, which was the basis for equity-method accounting in the Group's Consolidated Financial Statements.

Summarized IFRS financial information on significant associated companies accounted for using the equity method

Daimler Truck/uni00B9 Daimler Truck/uni00B9 Daimler Truck/uni00B9 BBAC 2
/two.tf/zero.tf/two.tf/two.tf /two.tf/zero.tf/two.tf/one.tf /two.tf/zero.tf/two.tf/two.tf /two.tf/zero.tf/two.tf/one.tf
In millions of euros
Information on the statement of income
Revenue /five.tf/zero.tf,/nine.tf/four.tf/five.tf /two.tf/eight.tf,/four.tf/one.tf/eight.tf /two.tf/four.tf,/eight.tf/two.tf/zero.tf /two.tf/one.tf,/two.tf/eight.tf/eight.tf
Profit/loss after taxes /two.tf,/seven.tf/six.tf/three.tf /two.tf,/two.tf/six.tf/five.tf /three.tf,/six.tf/four.tf/nine.tf /three.tf,/two.tf/zero.tf/five.tf
Other comprehensive income/loss /one.tf,/three.tf/two.tf/zero.tf /one.tf,/one.tf/nine.tf/six.tf /five.tf/two.tf -/three.tf/four.tf
Total comprehensive income/loss /four.tf,/zero.tf/eight.tf/three.tf /three.tf,/four.tf/six.tf/one.tf /three.tf,/seven.tf/zero.tf/one.tf /three.tf,/one.tf/seven.tf/one.tf
Information on the statement of financial position and reconciliation to the equity-method carrying amounts Information on the statement of financial position and reconciliation to the equity-method carrying amounts
Non-current assets /three.tf/eight.tf,/nine.tf/five.tf/seven.tf /three.tf/three.tf,/five.tf/six.tf/one.tf /seven.tf,/one.tf/zero.tf/one.tf /seven.tf,/one.tf/seven.tf/nine.tf
Current assets /three.tf/two.tf,/three.tf/seven.tf/one.tf /two.tf/eight.tf,/three.tf/seven.tf/zero.tf /nine.tf,/three.tf/six.tf/one.tf /eight.tf,/one.tf/nine.tf/seven.tf
Non-current liabilities /two.tf/two.tf,/four.tf/five.tf/one.tf /one.tf/seven.tf,/nine.tf/six.tf/two.tf /one.tf,/one.tf/two.tf/two.tf /one.tf,/one.tf/one.tf/two.tf
Current liabilities /two.tf/one.tf,/one.tf/five.tf/zero.tf /one.tf/eight.tf,/eight.tf/one.tf/six.tf /eight.tf,/five.tf/nine.tf/two.tf /eight.tf,/one.tf/one.tf/six.tf
Equity (including non-controlling interests) /two.tf/seven.tf,/seven.tf/two.tf/seven.tf /two.tf/five.tf,/one.tf/five.tf/three.tf /six.tf,/seven.tf/four.tf/eight.tf /six.tf,/one.tf/four.tf/eight.tf
Equity (excluding non-controlling interests) attributable to the Group /eight.tf,/one.tf/zero.tf/seven.tf /eight.tf,/five.tf/seven.tf/nine.tf /three.tf,/three.tf/zero.tf/six.tf /three.tf,/zero.tf/one.tf/three.tf
Unrealized profit (-)/loss (+) on sales to/purchases from - - -/three.tf/three.tf/two.tf -/two.tf/five.tf/eight.tf
Other reconciliation effects including equity-method goodwill and impairments on the investment /nine.tf/two.tf /one.tf/eight.tf/three.tf -/one.tf -/two.tf
Equity-method carrying amount /eight.tf,/one.tf/nine.tf/nine.tf /eight.tf,/seven.tf/six.tf/two.tf /two.tf,/nine.tf/seven.tf/three.tf /two.tf,/seven.tf/five.tf/three.tf

2 BBAC:

Figures for the statement of income relate to the period of 1 January to 31 December.

Figures for the statement of financial position and the reconciliation to the equity-method carrying amounts relate to the balance sheet date of 31 December and include investor level

adjustments.

...

Docling version

...
Docling version: 2.14.0
Docling Core version: 2.12.1
Docling IBM Models version: 3.1.0
Docling Parse version: 3.0.0

Python version

...
Python 3.11.5

@pbonito pbonito added the bug Something isn't working label Jan 8, 2025
@samcps
Copy link

samcps commented Jan 9, 2025

This is a major defect.

Any indication of the root cause?

@hadjebi
Copy link

hadjebi commented Jan 9, 2025

We also faced the same issue.

@cau-git
Copy link
Contributor

cau-git commented Jan 9, 2025

This is an issue in docling-parse v2, which is not handling some fonts correctly. We will track this in https://github.com/DS4SD/docling-parse/issues.

As a workaround, you can try your luck with docling-parse-v1 or pypdfium backends.

@pbonito
Copy link
Author

pbonito commented Jan 10, 2025

Thanks @cau-git .
Same error with DoclingParseDocumentBackend and PyPdfiumDocumentBackend.

@PeterStaar-IBM
Copy link
Contributor

@pbonito @hadjebi @samcps @cau-git This is now fixed here: DS4SD/docling-parse#82. Will review a few more bugs documents before merging, but should happen in the next days.

@pbonito
Copy link
Author

pbonito commented Jan 16, 2025

@PeterStaar-IBM any progress on this? We are stuck on this issue.

@PeterStaar-IBM
Copy link
Contributor

We will release soon (this week) a new release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working PDF parsing
Projects
None yet
Development

No branches or pull requests

5 participants