You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Note that the ground-truth is based on the image-pdf, most of which have abandon areas, i.e. in the image-pdf some special graphics in headers and footers have been removed. This means that when comparing the extracted vs ground-truth layouts based on the native pdfs, we should expect some differences in the headers/footers areas.
Starting from the ground-truth layout, which, for each PDF page, looks like
We consider the array layout_dets and extract the list of category_type with its corresponding poly, which encodes the position information: coordinates (x,y) for top-left, top-right, bottom-right, bottom-left corners of the bounding box.
category_type can be one among
# Block level annotation boxes
'title' # Title
'text_block' # Paragraph level plain text
'figure', # Figure type
'figure_caption', # Figure description/title
'figure_footnote', # Figure notes
'table', # Table body
'table_caption', # Table description/title
'table_footnote', # Table notes
'equation_isolated', # Display formula
'equation_caption', # Formula number
'header' # Header
'footer' # Footer
'page_number' # Page number
'page_footnote' # Page notes
'abandon', # Other discarded content (e.g. irrelevant information in middle of page)
'code_txt', # Code block
'code_txt_caption', # Code block description
'reference' # References
We extract the same information from the Megaparse output, i.e. for each page we extract the element type and the bounding box, group the pages per document type, per language, per layout type, and compute:
fraction of correctly extracted blocks in each block category. A block is correctly extracted if
the normalized category match with the ground truth (we need to establish the correspondance between the Megaparse categories and the those above)
AND the bounding box coordinates match within some errors. We can set the error to 1 pixel initially and refine (increase/decrease) the error after the first tests.
average fraction of correctly extracted blocks, i.e. we compute the average of the fraction of each block (this means that each block category will contribute equally to the metric)
fraction of correctlyextracted blocks across all categories (more numerous blocks, likely text blocks, will contribute more to the metric)
We can also have compute the metrics above across all document types.
The text was updated successfully, but these errors were encountered:
Note that the ground-truth is based on the image-pdf, most of which have abandon areas, i.e. in the image-pdf some special graphics in headers and footers have been removed. This means that when comparing the extracted vs ground-truth layouts based on the native pdfs, we should expect some differences in the headers/footers areas.
Starting from the ground-truth layout, which, for each PDF page, looks like
We consider the array
layout_dets
and extract the list ofcategory_type
with its correspondingpoly
, which encodes the position information: coordinates (x,y) for top-left, top-right, bottom-right, bottom-left corners of the bounding box.category_type
can be one amongWe extract the same information from the Megaparse output, i.e. for each page we extract the element type and the bounding box, group the pages per document type, per language, per layout type, and compute:
We can also have compute the metrics above across all document types.
The text was updated successfully, but these errors were encountered: