Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Table detection from PDF document not accurate #3804

Open
kmrspace opened this issue Dec 2, 2024 · 0 comments
Open

Table detection from PDF document not accurate #3804

kmrspace opened this issue Dec 2, 2024 · 0 comments
Labels
bug Something isn't working pdf

Comments

@kmrspace
Copy link

kmrspace commented Dec 2, 2024

Describe the bug
I used a financial statement to extract the elements in it. But it does not identify the tables properly. Sometimes last few rows are missing, sometimes few columns are missing altogether. Also the first few rows that represent the table header is also missing in the chunk, which makes the RAG system unbale to find the right answer.

To Reproduce

Here is the python code used

elements = partition_pdf(
filename=filename,
# Unstructured Helpers
chunking_strategy="by_title",
strategy="hi_res",
infer_table_structure=True,
max_partition = 3000,
max_characters= 3000,
#overlap_all = True
)

I used this file : https://investors.intuit.com/_assets/_a21b3f6dd5cf08cb659458f26330acce/intuit/news/2024-08-22_Intuit_Reports_Strong_Fourth_Quarter_and_Full_1202.pdf

Look at some of the complex tables in this and compare these outputs

str(elements[13].metadata.text_as_html)
'

July 31, 2024July 31, 2023July 31, 2024July 31, 2023
st revenue:
Service$2,670$2,340$13,861$12,317
Product and other5143722,4242,051
Total net revenue3,184—~=*«‘T1216,285«14,368.
sts and expenses:
Cost of revenue:
Cost of service revenue7336563,2502,908
Cost of product and other revenue14166972
Amortization of acquired technology3641146163
Selling and marketing1,1048404,3123,762
Research and development7256802,7542,539
General and administrative3773411,4181,300
Amortization of other acquired
intangible assets123121483483
Restructuring223223
Total costs and expenses [A]3,3352,69512,65511,227
Operating income (loss)(151)173,6303,141
erest expense(60)(68)(242)(248)
erest and other income, net714616296
some (loss) before income taxes(140)(5)3,5502,989
come tax provision (benefit) [B](120)(94)587605
st income (loss)$(20)$89$2,963$2,384
isic net income (loss) per share$(0.07)$0.32$10.58$8.49
lares used in basic per share Iculations280280280281
'

str(elements[13].text)

'Three Months Ended Twelve Months Ended July 31, July 31, July 31, July 31, (In millions) 2024 2023 2024 2023 Cost of revenue $ 102 $ 83 $ 402 $ 374 Selling and marketing 137 119 506 Research and development 161 148 639 General and administrative 94 98 368 Restructuring 25 — 25 Total share-based compensation expense $ 519 $ 448 $ 1,940 $'

  1. table content (mainly in the text form) misses some of the rows or cells and do not match with the actual table
  2. The top row 'Three months ended' is completely missed in the html table element
  3. initial few of characters of each row in the table are missed
  4. text content is not matching with the html content at all, text version misses 1 column completely

Expected behavior

  1. There should be no difference between text and html versions of the chunks with table elements. It not clear which is more reliable for embedding
  2. Header of the table is missed in the table element. also the description the table falls into the previous element, so the table looses the context
  3. In some cases table is also split between chunks, so the partial table looses the context of the parent table

Screenshots
If applicable, add screenshots to help explain your problem.

Environment Info
PyTorch version: 2.5.1+cpu
Is debug build: False
CUDA used to build PyTorch: Could not collect
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 11 Enterprise
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: N/A

Python version: 3.12.4 (tags/v3.12.4:8e8a4ba, Jun 6 2024, 19:30:16) [MSC v.1940 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-11-10.0.22631-SP0
Is CUDA available: False
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: GPU 0: Quadro T1000 with Max-Q Design
Nvidia driver version: 538.18
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture=9
CurrentClockSpeed=2310
DeviceID=CPU0
Family=198
L2CacheSize=1536
L2CacheSpeed=
Manufacturer=GenuineIntel
MaxClockSpeed=2712
Name=Intel(R) Core(TM) i7-10850H CPU @ 2.70GHz
ProcessorType=3
Revision=

Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] onnx==1.16.1
[pip3] onnxruntime==1.19.2
[pip3] torch==2.5.1
[pip3] torchvision==0.20.1
[conda] Could not collect
Screenshot 2024-12-02 181844

Additional context
Add any other context about the problem here.

@kmrspace kmrspace added the bug Something isn't working label Dec 2, 2024
@scanny scanny added the pdf label Dec 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pdf
Projects
None yet
Development

No branches or pull requests

2 participants