Table detection from PDF document not accurate #3804

kmrspace · 2024-12-02T12:51:24Z

Describe the bug
I used a financial statement to extract the elements in it. But it does not identify the tables properly. Sometimes last few rows are missing, sometimes few columns are missing altogether. Also the first few rows that represent the table header is also missing in the chunk, which makes the RAG system unbale to find the right answer.

To Reproduce

Here is the python code used

elements = partition_pdf(
filename=filename,
# Unstructured Helpers
chunking_strategy="by_title",
strategy="hi_res",
infer_table_structure=True,
max_partition = 3000,
max_characters= 3000,
#overlap_all = True
)

I used this file : https://investors.intuit.com/_assets/_a21b3f6dd5cf08cb659458f26330acce/intuit/news/2024-08-22_Intuit_Reports_Strong_Fourth_Quarter_and_Full_1202.pdf

Look at some of the complex tables in this and compare these outputs

str(elements[13].metadata.text_as_html)
'

	July 31, 2024	July 31, 2023	July 31, 2024	July 31, 2023
st revenue:
Service	$	2,670	$	2,340	$	13,861	$	12,317
Product and other		514		372		2,424		2,051
Total net revenue		3,184		—~=*«‘T12		16,285		«14,368.
sts and expenses:
Cost of revenue:
Cost of service revenue		733		656		3,250		2,908
Cost of product and other revenue		14		16		69		72
Amortization of acquired technology		36		41		146		163
Selling and marketing		1,104		840		4,312		3,762
Research and development		725		680		2,754		2,539
General and administrative		377		341		1,418		1,300
Amortization of other acquired
intangible assets		123		121		483		483
Restructuring		223		—		223		—
Total costs and expenses [A]		3,335		2,695		12,655		11,227
Operating income (loss)		(151)		17		3,630		3,141
erest expense		(60)		(68)		(242)		(248)
erest and other income, net		71	46		162		96
some (loss) before income taxes		(140)		(5)		3,550		2,989
come tax provision (benefit) [B]		(120)		(94)		587		605
st income (loss)	$	(20)	$	89	$	2,963	$	2,384
isic net income (loss) per share	$	(0.07)	$	0.32	$	10.58	$	8.49
lares used in basic per share Iculations		280		280		280		281

'

str(elements[13].text)

'Three Months Ended Twelve Months Ended July 31, July 31, July 31, July 31, (In millions) 2024 2023 2024 2023 Cost of revenue $ 102 $ 83 $ 402 $ 374 Selling and marketing 137 119 506 Research and development 161 148 639 General and administrative 94 98 368 Restructuring 25 — 25 Total share-based compensation expense $ 519 $ 448 $ 1,940 $'

table content (mainly in the text form) misses some of the rows or cells and do not match with the actual table
The top row 'Three months ended' is completely missed in the html table element
initial few of characters of each row in the table are missed
text content is not matching with the html content at all, text version misses 1 column completely

Expected behavior

There should be no difference between text and html versions of the chunks with table elements. It not clear which is more reliable for embedding
Header of the table is missed in the table element. also the description the table falls into the previous element, so the table looses the context
In some cases table is also split between chunks, so the partial table looses the context of the parent table

Screenshots
If applicable, add screenshots to help explain your problem.

Environment Info
PyTorch version: 2.5.1+cpu
Is debug build: False
CUDA used to build PyTorch: Could not collect
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 11 Enterprise
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: N/A

Python version: 3.12.4 (tags/v3.12.4:8e8a4ba, Jun 6 2024, 19:30:16) [MSC v.1940 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-11-10.0.22631-SP0
Is CUDA available: False
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: GPU 0: Quadro T1000 with Max-Q Design
Nvidia driver version: 538.18
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture=9
CurrentClockSpeed=2310
DeviceID=CPU0
Family=198
L2CacheSize=1536
L2CacheSpeed=
Manufacturer=GenuineIntel
MaxClockSpeed=2712
Name=Intel(R) Core(TM) i7-10850H CPU @ 2.70GHz
ProcessorType=3
Revision=

Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] onnx==1.16.1
[pip3] onnxruntime==1.19.2
[pip3] torch==2.5.1
[pip3] torchvision==0.20.1
[conda] Could not collect

Additional context
Add any other context about the problem here.

kmrspace added the bug Something isn't working label Dec 2, 2024

scanny added the pdf label Dec 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Table detection from PDF document not accurate #3804

Table detection from PDF document not accurate #3804

kmrspace commented Dec 2, 2024 •

edited

Loading

Table detection from PDF document not accurate #3804

Table detection from PDF document not accurate #3804

Comments

kmrspace commented Dec 2, 2024 • edited Loading

kmrspace commented Dec 2, 2024 •

edited

Loading