Skip to content

Commit 2bcffa7

Browse files
committed
project init
1 parent d795efd commit 2bcffa7

15 files changed

+2085
-1
lines changed

.gitattributes

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
CompHRDoc.zip filter=lfs diff=lfs merge=lfs -text

CompHRDoc.zip

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
version https://git-lfs.github.com/spec/v1
2+
oid sha256:530f482b75523a80fe1b0a7480fd8273c44f9239e0189650a4841c0aae61d03d
3+
size 129857097

README.md

Lines changed: 90 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,91 @@
11
# CompHRDoc
2-
Datasets and Evaluation Scripts for CompHRDoc
2+
3+
Comp-HRDoc is the first comprehensive benchmark, specifically designed for hierarchical document structure analysis. It encompasses tasks such as page object detection, reading order prediction, table of contents extraction, and hierarchical structure reconstruction. Comp-HRDoc is built upon the [HRDoc-Hard dataset](https://github.com/jfma-USTC/HRDoc), which comprises 1,000 documents for training and 500 documents for testing. We retain all original images without modification and extend the original annotations to accommodate the evaluation of these included tasks.
4+
5+
## News
6+
7+
- **We released the annotations of the Comp-HRDoc benchmark, please refer to [`CompHRDoc.zip`](./CompHRDoc.zip).**
8+
- **We released the evaluation tool of the Comp-HRDoc benchmark, please refer to [`evaluation`](evaluation/) folder.**
9+
- **We released the original paper, [Detect-Order-Construct: A Tree Construction based Approach for Hierarchical Document Structure Analysis](https://arxiv.org/pdf/2401.11874.pdf), to Arxiv.**
10+
11+
## Introduction
12+
13+
Document Structure Analysis (DSA) is a comprehensive process that identifies the fundamental components within a document, encompassing headings, paragraphs, lists, tables, and figures, and subsequently establishes the logical relationships and structures of these components. This process results in a structured representation of the document’s physical layout that accurately mirrors its logical structure, thereby enhancing the effectiveness and accessibility of information retrieval and processing. In a contemporary digital landscape, the majority of mainstream documents are structured creations, crafted using hierarchical-schema authoring software such as LaTeX, Microsoft Word, and HTML. Consequently, Hierarchical Document Structure Analysis (HDSA), which focuses on extracting and reconstructing the inherent hierarchical structures within these document layouts, has gained significant attention. Previous datasets primarily focus on specific sub-tasks of DSA, such as Page Object Detection, Reading Order Prediction, and Table of Contents (TOC) Extraction, among others. Despite the substantial progress achieved in these individual sub-tasks, there remains a gap in the research community for a comprehensive end-to-end system or benchmark that addresses all aspects of document structure analysis concurrently. Leveraging the HRDoc dataset, we establish a comprehensive benchmark, Comp-HRDoc, aimed at evaluating page object detection, reading order prediction, table of contents extraction, and hierarchical structure reconstruction concurrently.
14+
15+
<!-- ![](assets/example.png) -->
16+
<img src="assets/example.png" height="500" alt="">
17+
18+
### Data Directory Structure
19+
20+
```plaintext
21+
Comp-HRDoc/
22+
├── HRDH_MSRA_POD_TRAIN/
23+
│ ├── Images/ # put the document images of HRDoc-Hard training set into this folder
24+
│ │ ├── 1401.6399_0.png
25+
│ │ ├── 1401.6399_1.png
26+
│ │ └── ...
27+
│ ├── hdsa_train.json
28+
│ ├── coco_train.json
29+
│ └── ...
30+
└──HRDH_MSRA_POD_TEST/
31+
├── Images/ # put the document images of HRDoc-Hard test set into this folder
32+
│ ├── 1401.3699_0.png
33+
│ ├── 1401.3699_1.png
34+
│ └── ...
35+
├── test_eval/ # hierarchical document structure for evaluation
36+
│ ├── 1401.3699.json
37+
│ ├── 1402.2741.json
38+
│ └── ...
39+
├── test_eval_toc/ # table of contents structure for evaluation
40+
│ ├── 1401.3699.json
41+
│ ├── 1402.2741.json
42+
│ └── ...
43+
├── hdsa_test.json
44+
├── coco_test.json
45+
└── ...
46+
```
47+
48+
**For a detailed explanation of each file and folder, please refer to [`datasets/Comp-HRDoc/HRDH_MSRA_POD_TRAIN/README.md`](datasets/Comp-HRDoc/HRDH_MSRA_POD_TRAIN/README.md) and [`datasets/Comp-HRDoc/HRDH_MSRA_POD_TEST/README.md`](datasets/Comp-HRDoc/HRDH_MSRA_POD_TEST/README.md).**
49+
50+
**Due to license restrictions, please go to [HRDoc-Hard dataset](https://github.com/jfma-USTC/HRDoc) to download the images of HRDoc-Hard and put them into the corresponding folders.**
51+
52+
### Detect-Order-Construct
53+
54+
We proposed a comprehensive approach to thoroughly analyzing hierarchical document structures using a tree construction based method. This method decomposes tree construction into three distinct stages, namely Detect, Order, and Construct. Initially, given a set of document images, the Detect stage is dedicated to identifying all page objects and assigning a logical role to each object, thereby forming the nodes of the hierarchical document structure tree. Following this, the Order stage establishes the reading order relationships among these nodes, which corresponds to a pre-order traversal of the hierarchical document structure tree. Finally, the Construct stage identifies hierarchical relationships (e.g., Table of Contents) between semantic units to construct an abstract hierarchical document structure tree. By integrating the results of all three stages, we can effectively construct a complete hierarchical document structure tree, facilitating a more comprehensive understanding of complex documents.
55+
56+
<img src="assets/pipeline.png">
57+
58+
## Results
59+
60+
### Hierarchical Document Structure Reconstruction on HRDoc
61+
<img src="assets/hrdoc_results.png">
62+
63+
### End-to-End Evaluation on Comp-HRDoc
64+
<img src="assets/results.png">
65+
66+
## Contributing
67+
68+
This project welcomes contributions and suggestions. Most contributions require you to agree to a
69+
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
70+
the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
71+
72+
When you submit a pull request, a CLA bot will automatically determine whether you need to provide
73+
a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions
74+
provided by the bot. You will only need to do this once across all repos using our CLA.
75+
76+
This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
77+
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
78+
contact [[email protected]](mailto:[email protected]) with any additional questions or comments.
79+
80+
## 📝Citing
81+
82+
If you find this code useful, please consider to cite our work.
83+
84+
```
85+
@article{wang2024detect,
86+
title={Detect-Order-Construct: A Tree Construction based Approach for Hierarchical Document Structure Analysis},
87+
author={Wang, Jiawei and Hu, Kai and Zhong, Zhuoyao and Sun, Lei and Huo, Qiang},
88+
journal={arXiv preprint arXiv:2401.11874},
89+
year={2024}
90+
}
91+
```

assets/example.png

845 KB
Loading

assets/hrdoc_results.png

47.9 KB
Loading

assets/pipeline.png

393 KB
Loading

assets/results.png

67.4 KB
Loading
Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
#!/usr/bin/env python
2+
# -*- coding:utf-8 -*-
3+
###
4+
# Author: JeffreyMa
5+
# -----
6+
# Copyright (c) 2023 iFLYTEK & USTC
7+
# -----
8+
# HISTORY:
9+
# Date By Comments
10+
# ---------- --- ----------------------------------------------------------
11+
###
12+
13+
import os
14+
import json
15+
import tqdm
16+
import logging
17+
import argparse
18+
from sklearn.metrics import f1_score
19+
from utils import trans_class
20+
21+
logging.basicConfig(format='%(asctime)s - %(levelname)s - %(name)s - %(message)s',
22+
datefmt='%m/%d/%Y %H:%M:%S',
23+
level=logging.INFO)
24+
logger = logging.getLogger(__name__)
25+
26+
class2id_dict = {
27+
"title": 0,
28+
"author": 1,
29+
"mail": 2,
30+
"affili": 3,
31+
"section": 4,
32+
"fstline": 5,
33+
"paraline": 6,
34+
"table": 7,
35+
"figure": 8,
36+
"caption": 9,
37+
"equation": 10,
38+
"footer": 11,
39+
"header": 12,
40+
"footnote": 13,
41+
}
42+
43+
def class2id(jdata, unit):
44+
class_ = unit['class']
45+
if class_ not in class2id_dict.keys():
46+
class_ = trans_class(jdata, unit)
47+
assert class_ in class2id_dict.keys(), "{} not in {} classes!".format(
48+
class_, len(class2id_dict.keys()))
49+
return class2id_dict[class_]
50+
51+
def assert_filetree(args):
52+
""" Make sure the json files of `gt_folder` and `pred_folder` are the same """
53+
54+
gt_folders = set(os.listdir(args.gt_folder))
55+
pred_folders = set(os.listdir(args.pred_folder))
56+
assert gt_folders == pred_folders, "{} and {} contains different PDF files!".format(
57+
args.gt_folder, args.pred_folder)
58+
59+
def main():
60+
parser = argparse.ArgumentParser()
61+
62+
# Required parameters
63+
parser.add_argument("--gt_folder", default="/Users/majiefeng/Desktop/讯飞实习/HRDoc工作相关/HRDoc_dataset/HRDH/test_1", type=str, required=True,
64+
help="The folder storing ground-truth files.")
65+
parser.add_argument("--pred_folder", default="/Users/majiefeng/Desktop/讯飞实习/HRDoc工作相关/HRDoc_dataset/HRDH/test_2", type=str, required=True,
66+
help="The folder storing predicted results.")
67+
68+
args = parser.parse_args()
69+
logging.info("Args received, gt_folder: {}, pred_folder: {}".format(args.gt_folder, args.pred_folder))
70+
71+
assert_filetree(args=args)
72+
logging.info("File tree matched, start parse through json files!")
73+
74+
gt_class = []
75+
pred_class = []
76+
77+
for pdf_file in tqdm.tqdm(os.listdir(args.gt_folder)):
78+
gt_file = os.path.join(args.gt_folder, pdf_file)
79+
pred_file = os.path.join(args.pred_folder, pdf_file)
80+
gt_json = json.load(open(gt_file))
81+
pred_json = json.load(open(pred_file))
82+
assert len(gt_json) == len(pred_json), "{} and {} contains different numbers of units".format(
83+
gt_file, pred_file)
84+
gt_class.extend([class2id(gt_json, x) for x in gt_json])
85+
pred_class.extend([class2id(pred_json, x) for x in pred_json])
86+
logging.info("Parse finished, got {} units in total. Start calculate f1!".format(len(gt_class)))
87+
88+
detailed_f1 = f1_score(gt_class, pred_class, average=None)
89+
macro_f1 = f1_score(gt_class, pred_class, average='macro')
90+
micro_f1 = f1_score(gt_class, pred_class, average='micro')
91+
logging.info("detailed_f1 : {}, macro_f1 : {}, micro_f1 : {}".format(str(detailed_f1), macro_f1, micro_f1))
92+
93+
if __name__ == "__main__":
94+
main()

0 commit comments

Comments
 (0)