Parsing eval: compute layout metrics #3550

jacopo-chevallard · 2025-01-22T09:03:16Z

Note that the ground-truth is based on the image-pdf, most of which have abandon areas, i.e. in the image-pdf some special graphics in headers and footers have been removed. This means that when comparing the extracted vs ground-truth layouts based on the native pdfs, we should expect some differences in the headers/footers areas.

Starting from the ground-truth layout, which, for each PDF page, looks like

{
  "extra": {
    "relation": [
      {
        "relation_type": "parent_son",
        "source_anno_id": 2,
        "target_anno_id": 3
      },
      {
        "relation_type": "parent_son",
        "source_anno_id": 5,
        "target_anno_id": 8
      }
    ]
  },
  "layout_dets": [
    {
      "anno_id": 6,
      "attribute": {
        "text_background": "white",
        "text_language": "text_simplified_chinese",
        "text_rotate": "normal"
      },
      "category_type": "title",
      "ignore": false,
      "line_with_spans": [
        {
          "category_type": "text_span",
          "poly": [
            109.3333333333331,
            121.73651418039208,
            722.1022134807848,
            121.73651418039208,
            722.1022134807848,
            195.75809149176507,
            109.3333333333331,
            195.75809149176507
          ],
          "text": "国资背景基金情况"
        }
      ],
      "order": 1,
      "poly": [
        102.5999912116609,
        120.87255879760278,
        719.3118659856144,
        120.87255879760278,
        719.3118659856144,
        194.14083813380114,
        102.5999912116609,
        194.14083813380114
      ],
      "text": "国资背景基金情况"
    },
    {
      "anno_id": 4,
      "attribute": {
        "text_background": "white",
        "text_language": "text_simplified_chinese",
        "text_rotate": "normal"
      },
      "category_type": "text_block",
      "ignore": false,
      "line_with_spans": [
        {
          "category_type": "text_span",
          "poly": [
            99.66504579139392,
            227.6650457913944,
            1269.333333333333,
            227.6650457913944,
            1269.333333333333,
            271.3365750838786,
            99.66504579139392,
            271.3365750838786
          ],
          "text": "2022年备案基金规模小幅回升，但仍未恢复至资管新规出台前的水平"
        }
      ],
      "order": 2,
      "poly": [
        97.71487020898245,
        226.92028692633914,
        1271.9932332148471,
        226.92028692633914,
        1271.9932332148471,
        264.88925750697814,
        97.71487020898245,
        264.88925750697814
      ],
      "text": "2022年备案基金规模小幅回升，但仍未恢复至资管新规出台前的水平"
    },
    {
      "anno_id": 3,
      "attribute": {
        "text_background": "white",
        "text_language": "text_simplified_chinese",
        "text_rotate": "normal"
      },
      "category_type": "figure_caption",
      "ignore": false,
      "line_with_spans": [
        {
          "category_type": "text_span",
          "poly": [
            253.94664201855937,
            321.21295194692755,
            1076.1203813864063,
            321.21295194692755,
            1076.1203813864063,
            364.93470762745034,
            253.94664201855937,
            364.93470762745034
          ],
          "text": "2014年-2023Q3国资背景基金的备案数量及规模"
        }
      ],
      "order": 3,
      "poly": [
        246.96994018554688,
        318.7444152832031,
        1088.26025390625,
        318.7444152832031,
        1088.26025390625,
        369.0964660644531,
        246.96994018554688,
        369.0964660644531
      ],
      "text": "2014年-2023Q3国资背景基金的备案数量及规模"
    },
    {
      "anno_id": 2,
      "category_type": "figure",
      "ignore": false,
      "order": 4,
      "poly": [
        118.08102792118407,
        379.29373168945347,
        1299.4279383691976,
        379.29373168945347,
        1299.4279383691976,
        1028.2773128579047,
        118.08102792118407,
        1028.2773128579047
      ]
    },
    {
      "anno_id": 8,
      "attribute": {
        "text_background": "white",
        "text_language": "text_simplified_chinese",
        "text_rotate": "normal"
      },
      "category_type": "figure_caption",
      "ignore": false,
      "line_with_spans": [
        {
          "category_type": "text_span",
          "poly": [
            1509.6758069519938,
            324.34247361866034,
            2292.4771492866826,
            324.34247361866034,
            2292.4771492866826,
            364.8196229053426,
            1509.6758069519938,
            364.8196229053426
          ],
          "text": "2014年-2023Q3国资背景基金数量TOP10地区"
        }
      ],
      "order": 5,
      "poly": [
        1497.726318359375,
        318.7418518066406,
        2301.80224609375,
        318.7418518066406,
        2301.80224609375,
        367.1272888183594,
        1497.726318359375,
        367.1272888183594
      ],
      "text": "2014年-2023Q3国资背景基金数量TOP10地区"
    },
    {
      "anno_id": 5,
      "category_type": "figure",
      "ignore": false,
      "order": 6,
      "poly": [
        1370.0374839590943,
        424.35013794251097,
        2552.3561471143494,
        424.35013794251097,
        2552.3561471143494,
        1026.8955618700252,
        1370.0374839590943,
        1026.8955618700252
      ]
    },
    {
      "anno_id": 9,
      "attribute": {
        "text_background": "white",
        "text_language": "text_simplified_chinese",
        "text_rotate": "normal"
      },
      "category_type": "title",
      "ignore": false,
      "line_with_spans": [
        {
          "category_type": "text_span",
          "poly": [
            169.67751098302242,
            1071.225836994341,
            328.08580770628134,
            1071.225836994341,
            328.08580770628134,
            1111.655822350311,
            169.67751098302242,
            1111.655822350311
          ],
          "text": "核心发现"
        }
      ],
      "order": 7,
      "poly": [
        170.92340081387997,
        1069.7956822171332,
        326.21460986860313,
        1069.7956822171332,
        326.21460986860313,
        1111.7494049722532,
        170.92340081387997,
        1111.7494049722532
      ],
      "text": "核心发现"
    },
    {
      "anno_id": 7,
      "attribute": {
        "text_background": "white",
        "text_language": "text_simplified_chinese",
        "text_rotate": "normal"
      },
      "category_type": "text_block",
      "ignore": false,
      "line_with_spans": [
        {
          "category_type": "text_span",
          "poly": [
            165.603649650326,
            1150.009124125815,
            2509.333333333333,
            1150.009124125815,
            2509.333333333333,
            1198.666666666666,
            165.603649650326,
            1198.666666666666
          ],
          "text": "- 2018年4月资管新规出台后，国资背景基金备案数量增速放缓且规模骤减，受新冠疫情影响，2021年新增基金规模再次下降，虽然"
        },
        {
          "category_type": "text_span",
          "poly": [
            219.22996126565647,
            1201.1457902508969,
            2250.770752144285,
            1201.1457902508969,
            2250.770752144285,
            1243.9433217869077,
            219.22996126565647,
            1243.9433217869077
          ],
          "text": "2022年基金规模回升至1.25万亿元，但仍未恢复至资管新规出台前的水平，2023前三季度新增规模略低于2022年同期。"
        }
      ],
      "order": 8,
      "poly": [
        172.66793877059249,
        1155.2640660519091,
        2514.2408071863138,
        1155.2640660519091,
        2514.2408071863138,
        1241.6284871157177,
        172.66793877059249,
        1241.6284871157177
      ],
      "text": "- 2018年4月资管新规出台后，国资背景基金备案数量增速放缓且规模骤减，受新冠疫情影响，2021年新增基金规模再次下降，虽然 2022年基金规模回升至1.25万亿元，但仍未恢复至资管新规出台前的水平，2023前三季度新增规模略低于2022年同期。"
    },
    {
      "anno_id": 1,
      "attribute": {
        "text_background": "white",
        "text_language": "text_simplified_chinese",
        "text_rotate": "normal"
      },
      "category_type": "text_block",
      "ignore": false,
      "line_with_spans": [
        {
          "category_type": "text_span",
          "poly": [
            161.7899369148969,
            1278.308761376868,
            2508,
            1278.308761376868,
            2508,
            1317.333333333333,
            161.7899369148969,
            1317.333333333333
          ],
          "text": "- 截至2023Q3全国国资背景基金备案数量累计9196只，基金规模累计8.91万亿元。基金注册区域集中于广东省、浙江省和江苏省，广东"
        },
        {
          "category_type": "text_span",
          "poly": [
            222.66666666666688,
            1325.3333333333335,
            1623.8331583485456,
            1325.3333333333335,
            1623.8331583485456,
            1365.333333333333,
            222.66666666666688,
            1365.333333333333
          ],
          "text": "省国资背景基金总规模遥遥领先。备案基金数量前10的省份基金数量占全国总量的"
        },
        {
          "category_type": "equation_ignore",
          "poly": [
            1624.4165959289367,
            1327.0154193159506,
            1703.7259660435407,
            1327.0154193159506,
            1703.7259660435407,
            1363.1237504250385,
            1624.4165959289367,
            1363.1237504250385
          ],
          "text": "73%"
        },
        {
          "category_type": "text_span",
          "poly": [
            1704.6905743174548,
            1322.6134268787764,
            2053.985160092844,
            1322.6134268787764,
            2053.985160092844,
            1370.6736155849724,
            1704.6905743174548,
            1370.6736155849724
          ],
          "text": "，规模占全国总量的"
        },
        {
          "category_type": "equation_ignore",
          "poly": [
            2055.1374027302004,
            1326.3706276890023,
            2149.276980264608,
            1326.3706276890023,
            2149.276980264608,
            1365.7029169328305,
            2055.1374027302004,
            1365.7029169328305
          ],
          "text": "68%。"
        }
      ],
      "order": 9,
      "poly": [
        171.69999831539863,
        1278.820932742719,
        2512.084408886781,
        1278.820932742719,
        2512.084408886781,
        1365.690053585406,
        171.69999831539863,
        1365.690053585406
      ],
      "text": "- 截至2023Q3全国国资背景基金备案数量累计9196只，基金规模累计8.91万亿元。基金注册区域集中于广东省、浙江省和江苏省，广东省国资背景基金总规模遥遥领先。备案基金数量前10的省份基金数量占全国总量的 73% ，规模占全国总量的 68%。"
    },
    {
      "anno_id": 10,
      "category_type": "abandon",
      "ignore": false,
      "order": null,
      "poly": [
        114.12910090860571,
        1403.1676953230935,
        175.21358196554792,
        1403.1676953230935,
        175.21358196554792,
        1462.6586681785502,
        114.12910090860571,
        1462.6586681785502
      ]
    },
    {
      "anno_id": 0,
      "attribute": {
        "text_background": "white",
        "text_language": "text_en_ch_mixed",
        "text_rotate": "normal"
      },
      "category_type": "footer",
      "ignore": false,
      "line_with_spans": [
        {
          "category_type": "text_span",
          "poly": [
            178.18192276049803,
            1409.8767302579377,
            288.0868232114207,
            1409.8767302579377,
            288.0868232114207,
            1467.2607048296584,
            178.18192276049803,
            1467.2607048296584
          ],
          "text": "CVINFO 投中信息"
        }
      ],
      "order": null,
      "poly": [
        180.18207532211585,
        1404.2778174322868,
        289.9793827860912,
        1404.2778174322868,
        289.9793827860912,
        1462.652231000048,
        180.18207532211585,
        1462.652231000048
      ],
      "text": "CVINFO 投中信息"
    }
  ],
  "page_info": {
    "height": 1500,
    "image_path": "eastmoney_59cde7e939acc3124df9d3f2c85b5a0ec41b9da1157d5be38e098672022b47cb.pdf_11.jpg",
    "page_attribute": {
      "data_source": "PPT2PDF",
      "language": "simplified_chinese",
      "layout": "1andmore_column",
      "special_issue": [
        "watermark"
      ]
    },
    "page_no": 11,
    "width": 2667
  }
}

We consider the array layout_dets and extract the list of category_type with its corresponding poly, which encodes the position information: coordinates (x,y) for top-left, top-right, bottom-right, bottom-left corners of the bounding box.

category_type can be one among

# Block level annotation boxes
'title'               # Title
'text_block'          # Paragraph level plain text
'figure',             # Figure type
'figure_caption',     # Figure description/title
'figure_footnote',    # Figure notes
'table',              # Table body
'table_caption',      # Table description/title
'table_footnote',     # Table notes
'equation_isolated',  # Display formula
'equation_caption',   # Formula number
'header'              # Header
'footer'              # Footer
'page_number'         # Page number
'page_footnote'       # Page notes
'abandon',            # Other discarded content (e.g. irrelevant information in middle of page)
'code_txt',           # Code block
'code_txt_caption',   # Code block description
'reference'          # References

We extract the same information from the Megaparse output, i.e. for each page we extract the element type and the bounding box, group the pages per document type, per language, per layout type, and compute:

fraction of correctly extracted blocks in each block category. A block is correctly extracted if
- the normalized category match with the ground truth (we need to establish the correspondance between the Megaparse categories and the those above)
- AND the bounding box coordinates match within some errors. We can set the error to 1 pixel initially and refine (increase/decrease) the error after the first tests.
average fraction of correctly extracted blocks, i.e. we compute the average of the fraction of each block (this means that each block category will contribute equally to the metric)
fraction of correctly extracted blocks across all categories (more numerous blocks, likely text blocks, will contribute more to the metric)

We can also have compute the metrics above across all document types.

The text was updated successfully, but these errors were encountered:

linear · 2025-01-22T09:03:17Z

CORE-331 Implement layout metrics

jacopo-chevallard self-assigned this Jan 22, 2025

jacopo-chevallard changed the title ~~Implement layout metrics~~ Parsing: layout metrics Jan 22, 2025

jacopo-chevallard changed the title ~~Parsing: layout metrics~~ Parsing: compute layout metrics Jan 28, 2025

jacopo-chevallard changed the title ~~Parsing: compute layout metrics~~ Parsing eval: compute layout metrics Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing eval: compute layout metrics #3550

Parsing eval: compute layout metrics #3550

jacopo-chevallard commented Jan 22, 2025 •

edited

Loading

linear bot commented Jan 22, 2025

Parsing eval: compute layout metrics #3550

Parsing eval: compute layout metrics #3550

Comments

jacopo-chevallard commented Jan 22, 2025 • edited Loading

linear bot commented Jan 22, 2025

jacopo-chevallard commented Jan 22, 2025 •

edited

Loading