microsoft
diff --git a/‎.gitattributes
Lines changed: 1 addition & 0 deletions b/‎.gitattributes
Lines changed: 1 addition & 0 deletions
diff --git a/‎CompHRDoc.zip
Lines changed: 3 additions & 0 deletions b/‎CompHRDoc.zip
Lines changed: 3 additions & 0 deletions
diff --git a/‎README.md
Lines changed: 90 additions & 1 deletion b/‎README.md
Lines changed: 90 additions & 1 deletion
diff --git a/‎assets/example.png
845 KB b/‎assets/example.png
845 KB
diff --git a/‎assets/hrdoc_results.png
47.9 KB b/‎assets/hrdoc_results.png
47.9 KB
diff --git a/‎assets/pipeline.png
393 KB b/‎assets/pipeline.png
393 KB
diff --git a/‎assets/results.png
67.4 KB b/‎assets/results.png
67.4 KB
diff --git a/‎evaluation/hrdoc_tool/classify_eval.py
Lines changed: 94 additions & 0 deletions b/‎evaluation/hrdoc_tool/classify_eval.py
Lines changed: 94 additions & 0 deletions
@@ -0,0 +1 @@
+CompHRDoc.zip filter=lfs diff=lfs merge=lfs -text
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:530f482b75523a80fe1b0a7480fd8273c44f9239e0189650a4841c0aae61d03d
+size 129857097
@@ -1,2 +1,91 @@
 # CompHRDoc
-Datasets and Evaluation Scripts for CompHRDoc
+
+Comp-HRDoc is the first comprehensive benchmark, specifically designed for hierarchical document structure analysis. It encompasses tasks such as page object detection, reading order prediction, table of contents extraction, and hierarchical structure reconstruction. Comp-HRDoc is built upon the [HRDoc-Hard dataset](https://github.com/jfma-USTC/HRDoc), which comprises 1,000 documents for training and 500 documents for testing. We retain all original images without modification and extend the original annotations to accommodate the evaluation of these included tasks.
+
+## News
+
+- **We released the annotations of the Comp-HRDoc benchmark, please refer to [`CompHRDoc.zip`](./CompHRDoc.zip).**
+- **We released the evaluation tool of the Comp-HRDoc benchmark, please refer to [`evaluation`](evaluation/) folder.**
+- **We released the original paper, [Detect-Order-Construct: A Tree Construction based Approach for Hierarchical Document Structure Analysis](https://arxiv.org/pdf/2401.11874.pdf), to Arxiv.**
+
+## Introduction
+
+Document Structure Analysis (DSA) is a comprehensive process that identifies the fundamental components within a document, encompassing headings, paragraphs, lists, tables, and figures, and subsequently establishes the logical relationships and structures of these components. This process results in a structured representation of the document’s physical layout that accurately mirrors its logical structure, thereby enhancing the effectiveness and accessibility of information retrieval and processing. In a contemporary digital landscape, the majority of mainstream documents are structured creations, crafted using hierarchical-schema authoring software such as LaTeX, Microsoft Word, and HTML. Consequently, Hierarchical Document Structure Analysis (HDSA), which focuses on extracting and reconstructing the inherent hierarchical structures within these document layouts, has gained significant attention. Previous datasets primarily focus on specific sub-tasks of DSA, such as Page Object Detection, Reading Order Prediction, and Table of Contents (TOC) Extraction, among others. Despite the substantial progress achieved in these individual sub-tasks, there remains a gap in the research community for a comprehensive end-to-end system or benchmark that addresses all aspects of document structure analysis concurrently. Leveraging the HRDoc dataset, we establish a comprehensive benchmark, Comp-HRDoc, aimed at evaluating page object detection, reading order prediction, table of contents extraction, and hierarchical structure reconstruction concurrently.
+
+<!-- ![](assets/example.png) -->
+<img src="assets/example.png" height="500" alt="">
+
+### Data Directory Structure
+
+```plaintext
+Comp-HRDoc/
+├── HRDH_MSRA_POD_TRAIN/
+│   ├── Images/ # put the document images of HRDoc-Hard training set into this folder
+│   │   ├── 1401.6399_0.png
+│   │   ├── 1401.6399_1.png
+│   │   └── ...
+│   ├── hdsa_train.json
+│   ├── coco_train.json
+│   └── ...
+└──HRDH_MSRA_POD_TEST/
+    ├── Images/ # put the document images of HRDoc-Hard test set into this folder
+    │   ├── 1401.3699_0.png
+    │   ├── 1401.3699_1.png
+    │   └── ...
+    ├── test_eval/ # hierarchical document structure for evaluation
+    │   ├── 1401.3699.json
+    │   ├── 1402.2741.json
+    │   └── ...
+    ├── test_eval_toc/ # table of contents structure for evaluation
+    │   ├── 1401.3699.json
+    │   ├── 1402.2741.json
+    │   └── ...
+    ├── hdsa_test.json
+    ├── coco_test.json
+    └── ...
+```
+
+**For a detailed explanation of each file and folder, please refer to [`datasets/Comp-HRDoc/HRDH_MSRA_POD_TRAIN/README.md`](datasets/Comp-HRDoc/HRDH_MSRA_POD_TRAIN/README.md) and [`datasets/Comp-HRDoc/HRDH_MSRA_POD_TEST/README.md`](datasets/Comp-HRDoc/HRDH_MSRA_POD_TEST/README.md).**
+
+**Due to license restrictions, please go to [HRDoc-Hard dataset](https://github.com/jfma-USTC/HRDoc) to download the images of HRDoc-Hard and put them into the corresponding folders.**
+
+### Detect-Order-Construct
+
+We proposed a comprehensive approach to thoroughly analyzing hierarchical document structures using a tree construction based method. This method decomposes tree construction into three distinct stages, namely Detect, Order, and Construct. Initially, given a set of document images, the Detect stage is dedicated to identifying all page objects and assigning a logical role to each object, thereby forming the nodes of the hierarchical document structure tree. Following this, the Order stage establishes the reading order relationships among these nodes, which corresponds to a pre-order traversal of the hierarchical document structure tree. Finally, the Construct stage identifies hierarchical relationships (e.g., Table of Contents) between semantic units to construct an abstract hierarchical document structure tree. By integrating the results of all three stages, we can effectively construct a complete hierarchical document structure tree, facilitating a more comprehensive understanding of complex documents.
+
+<img src="assets/pipeline.png">
+
+## Results
+
+### Hierarchical Document Structure Reconstruction on HRDoc
+<img src="assets/hrdoc_results.png">
+
+### End-to-End Evaluation on Comp-HRDoc
+<img src="assets/results.png">
+
+## Contributing
+
+This project welcomes contributions and suggestions.  Most contributions require you to agree to a
+Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
+the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
+
+When you submit a pull request, a CLA bot will automatically determine whether you need to provide
+a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions
+provided by the bot. You will only need to do this once across all repos using our CLA.
+
+This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
+For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
+contact [[email protected]](mailto:[email protected]) with any additional questions or comments.
+
+## 📝Citing
+
+If you find this code useful, please consider to cite our work.
+
+```
+@article{wang2024detect,
+  title={Detect-Order-Construct: A Tree Construction based Approach for Hierarchical Document Structure Analysis},
+  author={Wang, Jiawei and Hu, Kai and Zhong, Zhuoyao and Sun, Lei and Huo, Qiang},
+  journal={arXiv preprint arXiv:2401.11874},
+  year={2024}
+}
+```
@@ -0,0 +1,94 @@
+#!/usr/bin/env python
+# -*- coding:utf-8 -*-
+###
+# Author: JeffreyMa
+# -----
+# Copyright (c) 2023 iFLYTEK & USTC
+# -----
+# HISTORY:
+# Date      	By	Comments
+# ----------	---	----------------------------------------------------------
+###
+
+import os
+import json
+import tqdm
+import logging
+import argparse
+from sklearn.metrics import f1_score
+from utils import trans_class
+
+logging.basicConfig(format='%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
+                    datefmt='%m/%d/%Y %H:%M:%S',
+                    level=logging.INFO)
+logger = logging.getLogger(__name__)
+
+class2id_dict = {
+    "title": 0,
+    "author": 1,
+    "mail": 2,
+    "affili": 3,
+    "section": 4,
+    "fstline": 5,
+    "paraline": 6,
+    "table": 7,
+    "figure": 8,
+    "caption": 9,
+    "equation": 10,
+    "footer": 11,
+    "header": 12,
+    "footnote": 13,
+}
+
+def class2id(jdata, unit):
+    class_ = unit['class']
+    if class_ not in class2id_dict.keys():
+        class_ = trans_class(jdata, unit)
+    assert class_ in class2id_dict.keys(), "{} not in {} classes!".format(
+        class_, len(class2id_dict.keys()))
+    return class2id_dict[class_]
+
+def assert_filetree(args):
+    """ Make sure the json files of `gt_folder` and `pred_folder` are the same """
+
+    gt_folders = set(os.listdir(args.gt_folder))
+    pred_folders = set(os.listdir(args.pred_folder))
+    assert gt_folders == pred_folders, "{} and {} contains different PDF files!".format(
+        args.gt_folder, args.pred_folder)
+
+def main():
+    parser = argparse.ArgumentParser()
+
+    # Required parameters
+    parser.add_argument("--gt_folder", default="/Users/majiefeng/Desktop/讯飞实习/HRDoc工作相关/HRDoc_dataset/HRDH/test_1", type=str, required=True,
+                        help="The folder storing ground-truth files.")
+    parser.add_argument("--pred_folder", default="/Users/majiefeng/Desktop/讯飞实习/HRDoc工作相关/HRDoc_dataset/HRDH/test_2", type=str, required=True,
+                        help="The folder storing predicted results.")
+    
+    args = parser.parse_args()
+    logging.info("Args received, gt_folder: {}, pred_folder: {}".format(args.gt_folder, args.pred_folder))
+
+    assert_filetree(args=args)
+    logging.info("File tree matched, start parse through json files!")
+
+    gt_class = []
+    pred_class = []
+
+    for pdf_file in tqdm.tqdm(os.listdir(args.gt_folder)):
+        gt_file = os.path.join(args.gt_folder, pdf_file)
+        pred_file = os.path.join(args.pred_folder, pdf_file)
+        gt_json = json.load(open(gt_file))
+        pred_json = json.load(open(pred_file))
+        assert len(gt_json) == len(pred_json), "{} and {} contains different numbers of units".format(
+            gt_file, pred_file)
+        gt_class.extend([class2id(gt_json, x) for x in gt_json])
+        pred_class.extend([class2id(pred_json, x) for x in pred_json])
+    logging.info("Parse finished, got {} units in total. Start calculate f1!".format(len(gt_class)))
+
+    detailed_f1 = f1_score(gt_class, pred_class, average=None)
+    macro_f1 = f1_score(gt_class, pred_class, average='macro')
+    micro_f1 = f1_score(gt_class, pred_class, average='micro')
+    logging.info("detailed_f1 : {}, macro_f1 : {}, micro_f1 : {}".format(str(detailed_f1), macro_f1, micro_f1))
+
+if __name__ == "__main__":
+    main()
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+CompHRDoc.zip filter=lfs diff=lfs merge=lfs -text`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+version https://git-lfs.github.com/spec/v1`
	`2`	`+oid sha256:530f482b75523a80fe1b0a7480fd8273c44f9239e0189650a4841c0aae61d03d`
	`3`	`+size 129857097`