Update test GT

Signed-off-by: Christoph Auer <[email protected]>
DS4SD · Dec 17, 2024 · 8243325 · 8243325
1 parent 6c8c625
commit 8243325
Show file tree

Hide file tree

Showing 35 changed files with 95 additions and 76 deletions.
diff --git a/tests/data/groundtruth/docling_v1/2203.01017v2.doctags.txt b/tests/data/groundtruth/docling_v1/2203.01017v2.doctags.txt
@@ -207,11 +207,11 @@
 <caption>Figure 5: One of the benefits of TableFormer is that it is language agnostic, as an example, the left part of the illustration demonstrates TableFormer predictions on previously unseen language (Japanese). Additionally, we see that TableFormer is robust to variability in style and content, right side of the illustration shows the example of the TableFormer prediction from the FinTabNet dataset.</caption>
 </figure>
 <figure>
-<location><page_8><loc_35><loc_44><loc_61><loc_52></location>
+<location><page_8><loc_63><loc_44><loc_89><loc_52></location>
 </figure>
 <caption><location><page_8><loc_10><loc_41><loc_87><loc_42></location>Figure 6: An example of TableFormer predictions (bounding boxes and structure) from generated SynthTabNet table.</caption>
 <figure>
-<location><page_8><loc_63><loc_44><loc_89><loc_52></location>
+<location><page_8><loc_35><loc_44><loc_61><loc_52></location>
 <caption>Figure 6: An example of TableFormer predictions (bounding boxes and structure) from generated SynthTabNet table.</caption>
 </figure>
 <subtitle-level-1><location><page_8><loc_8><loc_37><loc_27><loc_38></location>5.5. Qualitative Analysis</subtitle-level-1>
@@ -285,24 +285,24 @@
 <paragraph><location><page_12><loc_10><loc_71><loc_47><loc_73></location>- · TableFormer output does not include the table cell content.</paragraph>
 <paragraph><location><page_12><loc_10><loc_67><loc_47><loc_69></location>- · There are occasional inaccuracies in the predictions of the bounding boxes.</paragraph>
 <paragraph><location><page_12><loc_8><loc_50><loc_47><loc_65></location>However, it is possible to mitigate those limitations by combining the TableFormer predictions with the information already present inside a programmatic PDF document. More specifically, PDF documents can be seen as a sequence of PDF cells where each cell is described by its content and bounding box. If we are able to associate the PDF cells with the predicted table cells, we can directly link the PDF cell content to the table cell structure and use the PDF bounding boxes to correct misalignments in the predicted table cell bounding boxes.</paragraph>
-<paragraph><location><page_12><loc_50><loc_51><loc_89><loc_64></location>- 7. Generate a new set of pair-wise matches between the corrected bounding boxes and PDF cells. This time use a modified version of the IOU metric, where the area of the intersection between the predicted and PDF cells is divided by the PDF cell area. In case there are multiple matches for the same PDF cell, the prediction with the higher score is preferred. This covers the cases where the PDF cells are smaller than the area of predicted or corrected prediction cells.</paragraph>
 <paragraph><location><page_12><loc_8><loc_47><loc_47><loc_50></location>Here is a step-by-step description of the prediction postprocessing:</paragraph>
 <paragraph><location><page_12><loc_8><loc_42><loc_47><loc_47></location>- 1. Get the minimal grid dimensions - number of rows and columns for the predicted table structure. This represents the most granular grid for the underlying table structure.</paragraph>
 <paragraph><location><page_12><loc_8><loc_36><loc_47><loc_42></location>- 2. Generate pair-wise matches between the bounding boxes of the PDF cells and the predicted cells. The Intersection Over Union (IOU) metric is used to evaluate the quality of the matches.</paragraph>
 <paragraph><location><page_12><loc_8><loc_33><loc_47><loc_36></location>- 3. Use a carefully selected IOU threshold to designate the matches as "good" ones and "bad" ones.</paragraph>
 <paragraph><location><page_12><loc_8><loc_29><loc_47><loc_33></location>- 3.a. If all IOU scores in a column are below the threshold, discard all predictions (structure and bounding boxes) for that column.</paragraph>
 <paragraph><location><page_12><loc_8><loc_24><loc_47><loc_28></location>- 4. Find the best-fitting content alignment for the predicted cells with good IOU per each column. The alignment of the column can be identified by the following formula:</paragraph>
-<paragraph><location><page_12><loc_50><loc_24><loc_89><loc_28></location>9a. Compute the top and bottom boundary of the horizontal band for each grid row (min/max y coordinates per row).</paragraph>
 <paragraph><location><page_12><loc_8><loc_13><loc_47><loc_16></location>where c is one of { left, centroid, right } and x$_{c}$ is the xcoordinate for the corresponding point.</paragraph>
-<paragraph><location><page_12><loc_50><loc_13><loc_89><loc_16></location>- 9d. Intersect the orphan's bounding box with the column bands, and map the cell to the closest grid column.</paragraph>
 <paragraph><location><page_12><loc_8><loc_10><loc_47><loc_13></location>- 5. Use the alignment computed in step 4, to compute the median x -coordinate for all table columns and the me-</paragraph>
-<paragraph><location><page_12><loc_50><loc_10><loc_89><loc_13></location>- 9e. If the table cell under the identified row and column is not empty, extend its content with the content of the or-</paragraph>
-<paragraph><location><page_12><loc_50><loc_21><loc_89><loc_23></location>- 9b. Intersect the orphan's bounding box with the row bands, and map the cell to the closest grid row.</paragraph>
-<paragraph><location><page_12><loc_50><loc_16><loc_89><loc_20></location>- 9c. Compute the left and right boundary of the vertical band for each grid column (min/max x coordinates per column).</paragraph>
-<paragraph><location><page_12><loc_50><loc_42><loc_89><loc_51></location>- 8. In some rare occasions, we have noticed that TableFormer can confuse a single column as two. When the postprocessing steps are applied, this results with two predicted columns pointing to the same PDF column. In such case we must de-duplicate the columns according to highest total column intersection score.</paragraph>
-<paragraph><location><page_12><loc_50><loc_28><loc_89><loc_41></location>- 9. Pick up the remaining orphan cells. There could be cases, when after applying all the previous post-processing steps, some PDF cells could still remain without any match to predicted cells. However, it is still possible to deduce the correct matching for an orphan PDF cell by mapping its bounding box on the geometry of the grid. This mapping decides if the content of the orphan cell will be appended to an already matched table cell, or a new table cell should be created to match with the orphan.</paragraph>
 <paragraph><location><page_12><loc_50><loc_68><loc_89><loc_73></location>dian cell size for all table cells. The usage of median during the computations, helps to eliminate outliers caused by occasional column spans which are usually wider than the normal.</paragraph>
 <paragraph><location><page_12><loc_50><loc_65><loc_89><loc_67></location>- 6. Snap all cells with bad IOU to their corresponding median x -coordinates and cell sizes.</paragraph>
+<paragraph><location><page_12><loc_50><loc_51><loc_89><loc_64></location>- 7. Generate a new set of pair-wise matches between the corrected bounding boxes and PDF cells. This time use a modified version of the IOU metric, where the area of the intersection between the predicted and PDF cells is divided by the PDF cell area. In case there are multiple matches for the same PDF cell, the prediction with the higher score is preferred. This covers the cases where the PDF cells are smaller than the area of predicted or corrected prediction cells.</paragraph>
+<paragraph><location><page_12><loc_50><loc_42><loc_89><loc_51></location>- 8. In some rare occasions, we have noticed that TableFormer can confuse a single column as two. When the postprocessing steps are applied, this results with two predicted columns pointing to the same PDF column. In such case we must de-duplicate the columns according to highest total column intersection score.</paragraph>
+<paragraph><location><page_12><loc_50><loc_28><loc_89><loc_41></location>- 9. Pick up the remaining orphan cells. There could be cases, when after applying all the previous post-processing steps, some PDF cells could still remain without any match to predicted cells. However, it is still possible to deduce the correct matching for an orphan PDF cell by mapping its bounding box on the geometry of the grid. This mapping decides if the content of the orphan cell will be appended to an already matched table cell, or a new table cell should be created to match with the orphan.</paragraph>
+<paragraph><location><page_12><loc_50><loc_24><loc_89><loc_28></location>9a. Compute the top and bottom boundary of the horizontal band for each grid row (min/max y coordinates per row).</paragraph>
+<paragraph><location><page_12><loc_50><loc_21><loc_89><loc_23></location>- 9b. Intersect the orphan's bounding box with the row bands, and map the cell to the closest grid row.</paragraph>
+<paragraph><location><page_12><loc_50><loc_16><loc_89><loc_20></location>- 9c. Compute the left and right boundary of the vertical band for each grid column (min/max x coordinates per column).</paragraph>
+<paragraph><location><page_12><loc_50><loc_13><loc_89><loc_16></location>- 9d. Intersect the orphan's bounding box with the column bands, and map the cell to the closest grid column.</paragraph>
+<paragraph><location><page_12><loc_50><loc_10><loc_89><loc_13></location>- 9e. If the table cell under the identified row and column is not empty, extend its content with the content of the or-</paragraph>
 <paragraph><location><page_13><loc_8><loc_89><loc_15><loc_91></location>phan cell.</paragraph>
 <paragraph><location><page_13><loc_8><loc_86><loc_47><loc_89></location>9f. Otherwise create a new structural cell and match it wit the orphan cell.</paragraph>
 <paragraph><location><page_13><loc_8><loc_83><loc_47><loc_86></location>Aditional images with examples of TableFormer predictions and post-processing can be found below.</paragraph>

diff --git a/tests/data/groundtruth/docling_v1/2203.01017v2.json b/tests/data/groundtruth/docling_v1/2203.01017v2.json
diff --git a/tests/data/groundtruth/docling_v1/2203.01017v2.md b/tests/data/groundtruth/docling_v1/2203.01017v2.md
@@ -409,8 +409,6 @@ Figure 7: Distribution of the tables across different dimensions per dataset. Si
 
 However, it is possible to mitigate those limitations by combining the TableFormer predictions with the information already present inside a programmatic PDF document. More specifically, PDF documents can be seen as a sequence of PDF cells where each cell is described by its content and bounding box. If we are able to associate the PDF cells with the predicted table cells, we can directly link the PDF cell content to the table cell structure and use the PDF bounding boxes to correct misalignments in the predicted table cell bounding boxes.
 
-- 7. Generate a new set of pair-wise matches between the corrected bounding boxes and PDF cells. This time use a modified version of the IOU metric, where the area of the intersection between the predicted and PDF cells is divided by the PDF cell area. In case there are multiple matches for the same PDF cell, the prediction with the higher score is preferred. This covers the cases where the PDF cells are smaller than the area of predicted or corrected prediction cells.
-
 Here is a step-by-step description of the prediction postprocessing:
 
 - 1. Get the minimal grid dimensions - number of rows and columns for the predicted table structure. This represents the most granular grid for the underlying table structure.
@@ -423,27 +421,29 @@ Here is a step-by-step description of the prediction postprocessing:
 
 - 4. Find the best-fitting content alignment for the predicted cells with good IOU per each column. The alignment of the column can be identified by the following formula:
 
-9a. Compute the top and bottom boundary of the horizontal band for each grid row (min/max y coordinates per row).
-
 where c is one of { left, centroid, right } and x$_{c}$ is the xcoordinate for the corresponding point.
 
-- 9d. Intersect the orphan's bounding box with the column bands, and map the cell to the closest grid column.
-
 - 5. Use the alignment computed in step 4, to compute the median x -coordinate for all table columns and the me-
 
-- 9e. If the table cell under the identified row and column is not empty, extend its content with the content of the or-
+dian cell size for all table cells. The usage of median during the computations, helps to eliminate outliers caused by occasional column spans which are usually wider than the normal.
 
-- 9b. Intersect the orphan's bounding box with the row bands, and map the cell to the closest grid row.
+- 6. Snap all cells with bad IOU to their corresponding median x -coordinates and cell sizes.
 
-- 9c. Compute the left and right boundary of the vertical band for each grid column (min/max x coordinates per column).
+- 7. Generate a new set of pair-wise matches between the corrected bounding boxes and PDF cells. This time use a modified version of the IOU metric, where the area of the intersection between the predicted and PDF cells is divided by the PDF cell area. In case there are multiple matches for the same PDF cell, the prediction with the higher score is preferred. This covers the cases where the PDF cells are smaller than the area of predicted or corrected prediction cells.
 
 - 8. In some rare occasions, we have noticed that TableFormer can confuse a single column as two. When the postprocessing steps are applied, this results with two predicted columns pointing to the same PDF column. In such case we must de-duplicate the columns according to highest total column intersection score.
 
 - 9. Pick up the remaining orphan cells. There could be cases, when after applying all the previous post-processing steps, some PDF cells could still remain without any match to predicted cells. However, it is still possible to deduce the correct matching for an orphan PDF cell by mapping its bounding box on the geometry of the grid. This mapping decides if the content of the orphan cell will be appended to an already matched table cell, or a new table cell should be created to match with the orphan.
 
-dian cell size for all table cells. The usage of median during the computations, helps to eliminate outliers caused by occasional column spans which are usually wider than the normal.
+9a. Compute the top and bottom boundary of the horizontal band for each grid row (min/max y coordinates per row).
 
-- 6. Snap all cells with bad IOU to their corresponding median x -coordinates and cell sizes.
+- 9b. Intersect the orphan's bounding box with the row bands, and map the cell to the closest grid row.
+
+- 9c. Compute the left and right boundary of the vertical band for each grid column (min/max x coordinates per column).
+
+- 9d. Intersect the orphan's bounding box with the column bands, and map the cell to the closest grid column.
+
+- 9e. If the table cell under the identified row and column is not empty, extend its content with the content of the or-
 
 phan cell.
 

diff --git a/tests/data/groundtruth/docling_v1/2203.01017v2.pages.json b/tests/data/groundtruth/docling_v1/2203.01017v2.pages.json
diff --git a/tests/data/groundtruth/docling_v1/2206.01062.json b/tests/data/groundtruth/docling_v1/2206.01062.json
diff --git a/tests/data/groundtruth/docling_v1/2206.01062.pages.json b/tests/data/groundtruth/docling_v1/2206.01062.pages.json
diff --git a/tests/data/groundtruth/docling_v1/2305.03393v1-pg9.json b/tests/data/groundtruth/docling_v1/2305.03393v1-pg9.json
diff --git a/tests/data/groundtruth/docling_v1/2305.03393v1-pg9.pages.json b/tests/data/groundtruth/docling_v1/2305.03393v1-pg9.pages.json
diff --git a/tests/data/groundtruth/docling_v1/2305.03393v1.doctags.txt b/tests/data/groundtruth/docling_v1/2305.03393v1.doctags.txt
@@ -1,6 +1,7 @@
 <document>
 <subtitle-level-1><location><page_1><loc_22><loc_82><loc_79><loc_85></location>Optimized Table Tokenization for Table Structure Recognition</subtitle-level-1>
-<paragraph><location><page_1><loc_23><loc_74><loc_78><loc_79></location>Maksym Lysak [0000 − 0002 − 3723 − $^{6960]}$, Ahmed Nassar[0000 − 0002 − 9468 − $^{0822]}$, Nikolaos Livathinos [0000 − 0001 − 8513 − $^{3491]}$, Christoph Auer[0000 − 0001 − 5761 − $^{0422]}$, and Peter Staar [0000 − 0002 − 8088 − 0823]</paragraph>
+<paragraph><location><page_1><loc_23><loc_75><loc_78><loc_79></location>Maksym Lysak [0000 − 0002 − 3723 − $^{6960]}$, Ahmed Nassar[0000 − 0002 − 9468 − $^{0822]}$, Nikolaos Livathinos [0000 − 0001 − 8513 − $^{3491]}$, Christoph Auer[0000 − 0001 − 5761 − $^{0422]}$, [0000 − 0002 − 8088 − 0823]</paragraph>
+<paragraph><location><page_1><loc_38><loc_74><loc_49><loc_75></location>and Peter Staar</paragraph>
 <paragraph><location><page_1><loc_46><loc_72><loc_55><loc_73></location>IBM Research</paragraph>
 <paragraph><location><page_1><loc_36><loc_70><loc_64><loc_71></location>{mly,ahn,nli,cau,taa}@zurich.ibm.com</paragraph>
 <paragraph><location><page_1><loc_27><loc_41><loc_74><loc_66></location>Abstract. Extracting tables from documents is a crucial task in any document conversion pipeline. Recently, transformer-based models have demonstrated that table-structure can be recognized with impressive accuracy using Image-to-Markup-Sequence (Im2Seq) approaches. Taking only the image of a table, such models predict a sequence of tokens (e.g. in HTML, LaTeX) which represent the structure of the table. Since the token representation of the table structure has a significant impact on the accuracy and run-time performance of any Im2Seq model, we investigate in this paper how table-structure representation can be optimised. We propose a new, optimised table-structure language (OTSL) with a minimized vocabulary and specific rules. The benefits of OTSL are that it reduces the number of tokens to 5 (HTML needs 28+) and shortens the sequence length to half of HTML on average. Consequently, model accuracy improves significantly, inference time is halved compared to HTML-based models, and the predicted table structures are always syntactically correct. This in turn eliminates most post-processing needs. Popular table structure data-sets will be published in OTSL format to the community.</paragraph>

diff --git a/tests/data/groundtruth/docling_v1/2305.03393v1.json b/tests/data/groundtruth/docling_v1/2305.03393v1.json
diff --git a/tests/data/groundtruth/docling_v1/2305.03393v1.md b/tests/data/groundtruth/docling_v1/2305.03393v1.md
@@ -1,6 +1,8 @@
 ## Optimized Table Tokenization for Table Structure Recognition
 
-Maksym Lysak [0000 − 0002 − 3723 − $^{6960]}$, Ahmed Nassar[0000 − 0002 − 9468 − $^{0822]}$, Nikolaos Livathinos [0000 − 0001 − 8513 − $^{3491]}$, Christoph Auer[0000 − 0001 − 5761 − $^{0422]}$, and Peter Staar [0000 − 0002 − 8088 − 0823]
+Maksym Lysak [0000 − 0002 − 3723 − $^{6960]}$, Ahmed Nassar[0000 − 0002 − 9468 − $^{0822]}$, Nikolaos Livathinos [0000 − 0001 − 8513 − $^{3491]}$, Christoph Auer[0000 − 0001 − 5761 − $^{0422]}$, [0000 − 0002 − 8088 − 0823]
+
+and Peter Staar
 
 IBM Research
 

diff --git a/tests/data/groundtruth/docling_v1/2305.03393v1.pages.json b/tests/data/groundtruth/docling_v1/2305.03393v1.pages.json