You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
I used a financial statement to extract the elements in it. But it does not identify the tables properly. Sometimes last few rows are missing, sometimes few columns are missing altogether. Also the first few rows that represent the table header is also missing in the chunk, which makes the RAG system unbale to find the right answer.
Look at some of the complex tables in this and compare these outputs
str(elements[13].metadata.text_as_html)
'
July 31, 2024
July 31, 2023
July 31, 2024
July 31, 2023
st revenue:
Service
$
2,670
$
2,340
$
13,861
$
12,317
Product and other
514
372
2,424
2,051
Total net revenue
3,184
—~=*«‘T12
16,285
«14,368.
sts and expenses:
Cost of revenue:
Cost of service revenue
733
656
3,250
2,908
Cost of product and other revenue
14
16
69
72
Amortization of acquired technology
36
41
146
163
Selling and marketing
1,104
840
4,312
3,762
Research and development
725
680
2,754
2,539
General and administrative
377
341
1,418
1,300
Amortization of other acquired
intangible assets
123
121
483
483
Restructuring
223
—
223
—
Total costs and expenses [A]
3,335
2,695
12,655
11,227
Operating income (loss)
(151)
17
3,630
3,141
erest expense
(60)
(68)
(242)
(248)
erest and other income, net
71
46
162
96
some (loss) before income taxes
(140)
(5)
3,550
2,989
come tax provision (benefit) [B]
(120)
(94)
587
605
st income (loss)
$
(20)
$
89
$
2,963
$
2,384
isic net income (loss) per share
$
(0.07)
$
0.32
$
10.58
$
8.49
lares used in basic per share Iculations
280
280
280
281
'
str(elements[13].text)
'Three Months Ended Twelve Months Ended July 31, July 31, July 31, July 31, (In millions) 2024 2023 2024 2023 Cost of revenue $ 102 $ 83 $ 402 $ 374 Selling and marketing 137 119 506 Research and development 161 148 639 General and administrative 94 98 368 Restructuring 25 — 25 Total share-based compensation expense $ 519 $ 448 $ 1,940 $'
table content (mainly in the text form) misses some of the rows or cells and do not match with the actual table
The top row 'Three months ended' is completely missed in the html table element
initial few of characters of each row in the table are missed
text content is not matching with the html content at all, text version misses 1 column completely
Expected behavior
There should be no difference between text and html versions of the chunks with table elements. It not clear which is more reliable for embedding
Header of the table is missed in the table element. also the description the table falls into the previous element, so the table looses the context
In some cases table is also split between chunks, so the partial table looses the context of the parent table
Screenshots
If applicable, add screenshots to help explain your problem.
Environment Info
PyTorch version: 2.5.1+cpu
Is debug build: False
CUDA used to build PyTorch: Could not collect
ROCM used to build PyTorch: N/A
OS: Microsoft Windows 11 Enterprise
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: N/A
Python version: 3.12.4 (tags/v3.12.4:8e8a4ba, Jun 6 2024, 19:30:16) [MSC v.1940 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-11-10.0.22631-SP0
Is CUDA available: False
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: GPU 0: Quadro T1000 with Max-Q Design
Nvidia driver version: 538.18
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Describe the bug
I used a financial statement to extract the elements in it. But it does not identify the tables properly. Sometimes last few rows are missing, sometimes few columns are missing altogether. Also the first few rows that represent the table header is also missing in the chunk, which makes the RAG system unbale to find the right answer.
To Reproduce
Here is the python code used
elements = partition_pdf(
filename=filename,
# Unstructured Helpers
chunking_strategy="by_title",
strategy="hi_res",
infer_table_structure=True,
max_partition = 3000,
max_characters= 3000,
#overlap_all = True
)
I used this file : https://investors.intuit.com/_assets/_a21b3f6dd5cf08cb659458f26330acce/intuit/news/2024-08-22_Intuit_Reports_Strong_Fourth_Quarter_and_Full_1202.pdf
Look at some of the complex tables in this and compare these outputs
str(elements[13].metadata.text_as_html)
'
str(elements[13].text)
'Three Months Ended Twelve Months Ended July 31, July 31, July 31, July 31, (In millions) 2024 2023 2024 2023 Cost of revenue $ 102 $ 83 $ 402 $ 374 Selling and marketing 137 119 506 Research and development 161 148 639 General and administrative 94 98 368 Restructuring 25 — 25 Total share-based compensation expense $ 519 $ 448 $ 1,940 $'
Expected behavior
Screenshots
If applicable, add screenshots to help explain your problem.
Environment Info
PyTorch version: 2.5.1+cpu
Is debug build: False
CUDA used to build PyTorch: Could not collect
ROCM used to build PyTorch: N/A
OS: Microsoft Windows 11 Enterprise
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: N/A
Python version: 3.12.4 (tags/v3.12.4:8e8a4ba, Jun 6 2024, 19:30:16) [MSC v.1940 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-11-10.0.22631-SP0
Is CUDA available: False
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: GPU 0: Quadro T1000 with Max-Q Design
Nvidia driver version: 538.18
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture=9
CurrentClockSpeed=2310
DeviceID=CPU0
Family=198
L2CacheSize=1536
L2CacheSpeed=
Manufacturer=GenuineIntel
MaxClockSpeed=2712
Name=Intel(R) Core(TM) i7-10850H CPU @ 2.70GHz
ProcessorType=3
Revision=
Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] onnx==1.16.1
[pip3] onnxruntime==1.19.2
[pip3] torch==2.5.1
[pip3] torchvision==0.20.1
[conda] Could not collect
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: