Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] add XFUND dataset and project LayoutLMv3 #1809

Open
wants to merge 58 commits into
base: dev-1.x
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 52 commits
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
24119ec
阶段性提交
KevinNuNu Mar 23, 2023
38fbba5
Merge branch 'dev-1.x' into layoutlm
KevinNuNu Mar 23, 2023
3ae3f84
重构xfund数据集config文件结构
KevinNuNu Mar 25, 2023
35d0dd9
新增xfund zh数据集
KevinNuNu Mar 25, 2023
125cce2
[Fix] 解决jsondumper生成的文件无法正确显示中文的问题
KevinNuNu Mar 25, 2023
f4f1dac
[Fix] 解决路径拼接异常Bug
KevinNuNu Mar 25, 2023
ffe8909
Merge branch 'dev-1.x' into layoutlm
KevinNuNu Mar 25, 2023
5b203ad
新增另外6个数据集config文件
KevinNuNu Mar 25, 2023
2016921
增加xfund RE任务
KevinNuNu Mar 25, 2023
717ac03
pre-commit fix
KevinNuNu Mar 25, 2023
078cc83
[Fix] 简化XFUND parser,优化最终数据集的目录结构
KevinNuNu Mar 27, 2023
d3e16ad
[Fix] 回退删除huggingface dataset形式,没意义。修改ser/re packer的metainfo信息,阶段性添加S…
KevinNuNu Mar 28, 2023
1d0c5e3
阶段性完成SERDataset数据集加载
KevinNuNu Mar 28, 2023
deb96cc
优化ser/re packer,根据words关键字是否存在觉得是否加入
KevinNuNu Mar 30, 2023
443e979
优化xfund数据集的config_generator命名,使config_generator目录结构更清晰
KevinNuNu Mar 30, 2023
a88a129
修改SERDataset为XFUNDSERDataset
KevinNuNu Mar 30, 2023
25f084a
ser/re packer docstring fix
KevinNuNu Mar 30, 2023
f8f2614
add SERDataSample structure and PackSERInputs transforms
KevinNuNu Mar 30, 2023
c8a7b68
初步构建SER部分model文件结构,LayoutLMv3DataPreprocessor参数已与HuggingFace的LayoutLM…
KevinNuNu Mar 30, 2023
81a4527
Merge branch 'dev-1.x' into layoutlm
KevinNuNu Apr 10, 2023
e22e466
packer metainfo删除id2label信息
KevinNuNu Apr 11, 2023
ceb66dc
优化xfund_dataset
KevinNuNu Apr 11, 2023
2eb79c3
明确添加的metainfo类型
KevinNuNu Apr 11, 2023
a6bbe12
简化版layoutlmv3代码
KevinNuNu Apr 17, 2023
7951200
优化layoutlmv3预处理代码,整合到datasets/transforms里,更明确
KevinNuNu Apr 17, 2023
3ddf780
添加测试脚本
KevinNuNu Apr 17, 2023
60b2a52
Merge branch 'dev-1.x' into layoutlm
KevinNuNu Apr 17, 2023
84be264
重构xfund数据集mmocr格式
KevinNuNu Apr 18, 2023
2767fcc
简化XFUNDDataset,不再按ser/re任务区分
KevinNuNu Apr 19, 2023
4b4b343
将原本在XFUNDDataset内做的预处理全部移到pipeline中,重构预处理代码为LoadProcessorFromPretrain…
KevinNuNu Apr 19, 2023
8399f94
更新项目测试脚本
KevinNuNu Apr 19, 2023
bda6742
跑通train.py训练流程
KevinNuNu Apr 19, 2023
44c68b1
修改SERDataSample形式
KevinNuNu Apr 19, 2023
023b0cf
修改SERPostprocessor一个命名错误
KevinNuNu Apr 19, 2023
de98eb1
Merge branch 'dev-1.x' into layoutlm
KevinNuNu Apr 28, 2023
a05a2e1
整理config目录
KevinNuNu Apr 28, 2023
3664773
添加SER任务的评估模块
KevinNuNu Apr 28, 2023
6c1f5be
优化PackSERInputs
KevinNuNu Apr 29, 2023
d21a181
将数据处理部分代码移动到project中
KevinNuNu May 1, 2023
40cfe65
fix an error
KevinNuNu May 1, 2023
d1f43e7
将ser_data_sample移到projects里
KevinNuNu May 1, 2023
e102ef2
Merge branch 'dev-1.x' into layoutlm
KevinNuNu May 8, 2023
50fa7f9
规范xfund数据集准备脚本文件
KevinNuNu May 8, 2023
a04cd51
[Fix]解决推理时存在的一个bug
KevinNuNu May 8, 2023
81b8f86
使用custom_imports优化自定义模块的导入
KevinNuNu May 8, 2023
059e203
优化SER任务结果可视化效果
KevinNuNu May 8, 2023
f0a03ac
规范配置文件命名
KevinNuNu May 25, 2023
d9a3a5e
化繁为简,优化之前基于default_collate的long_text_data_collate为更明确易理解的ser_collate
KevinNuNu May 25, 2023
b04e126
针对inference阶段没有gt_label的情况针对性修复ser_postprocessor以及ser_visualizer中存在的bug.
KevinNuNu May 25, 2023
b6f55f8
优化ser_postprocessor
KevinNuNu May 29, 2023
edf7fe8
[Fix] 修复一个因为分词结果恰好510*n个,剔除收尾None标识后没有结束标志,导致最后一个label无法加入结果的Bug
KevinNuNu Jun 12, 2023
8a1e37b
[Fix] 重置word_biolabels防止重复添加
KevinNuNu Jun 12, 2023
dbe9145
Merge branch 'dev-1.x' into layoutlm
KevinNuNu Jun 26, 2023
0f0f8ca
删除项目中所有的绝对路径,补充README.md
KevinNuNu Jun 27, 2023
ae8c426
fix lint
gaotongxiao Oct 18, 2023
ab14bb0
Merge branch 'ci' into layoutlm
gaotongxiao Oct 20, 2023
db5673f
fix ci
gaotongxiao Oct 20, 2023
c7a3895
ci
gaotongxiao Oct 28, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .codespellrc
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@
skip = *.ipynb
count =
quiet-level = 3
ignore-words-list = convertor,convertors,formating,nin,wan,datas,hist,ned
ignore-words-list = convertor,convertors,formating,nin,wan,datas,hist,ned,ser
14 changes: 14 additions & 0 deletions configs/re/_base_/datasets/xfund_zh.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
xfund_zh_re_data_root = 'data/xfund/zh'

xfund_zh_re_train = dict(
type='XFUNDDataset',
data_root=xfund_zh_re_data_root,
ann_file='re_train.json',
pipeline=None)

xfund_zh_re_test = dict(
type='XFUNDDataset',
data_root=xfund_zh_re_data_root,
ann_file='re_test.json',
test_mode=True,
pipeline=None)
14 changes: 14 additions & 0 deletions configs/ser/_base_/datasets/xfund_zh.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
xfund_zh_ser_data_root = 'data/xfund/zh'

xfund_zh_ser_train = dict(
type='XFUNDDataset',
data_root=xfund_zh_ser_data_root,
ann_file='ser_train.json',
pipeline=None)

xfund_zh_ser_test = dict(
type='XFUNDDataset',
data_root=xfund_zh_ser_data_root,
ann_file='ser_test.json',
test_mode=True,
pipeline=None)
41 changes: 41 additions & 0 deletions dataset_zoo/xfund/de/metafile.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
Name: 'XFUND'
Paper:
Title: 'XFUND: A Benchmark Dataset for Multilingual Visually Rich Form Understanding'
URL: https://aclanthology.org/2022.findings-acl.253
Venue: ACL
Year: '2022'
BibTeX: '@inproceedings{xu-etal-2022-xfund,
title = "{XFUND}: A Benchmark Dataset for Multilingual Visually Rich Form Understanding",
author = "Xu, Yiheng and
Lv, Tengchao and
Cui, Lei and
Wang, Guoxin and
Lu, Yijuan and
Florencio, Dinei and
Zhang, Cha and
Wei, Furu",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2022",
month = may,
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.findings-acl.253",
doi = "10.18653/v1/2022.findings-acl.253",
pages = "3214--3224",
abstract = "Multimodal pre-training with text, layout, and image has achieved SOTA performance for visually rich document understanding tasks recently, which demonstrates the great potential for joint learning across different modalities. However, the existed research work has focused only on the English domain while neglecting the importance of multilingual generalization. In this paper, we introduce a human-annotated multilingual form understanding benchmark dataset named XFUND, which includes form understanding samples in 7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese). Meanwhile, we present LayoutXLM, a multimodal pre-trained model for multilingual document understanding, which aims to bridge the language barriers for visually rich document understanding. Experimental results show that the LayoutXLM model has significantly outperformed the existing SOTA cross-lingual pre-trained models on the XFUND dataset. The XFUND dataset and the pre-trained LayoutXLM model have been publicly available at https://aka.ms/layoutxlm.",
}'
Data:
Website: https://github.com/doc-analysis/XFUND
Language:
- Chinese, Japanese, Spanish, French, Italian, German, Portuguese
Scene:
- Document
Granularity:
- Word
Tasks:
- ser
- re
License:
Type: CC BY 4.0
Link: https://creativecommons.org/licenses/by/4.0/
Format: .json
6 changes: 6 additions & 0 deletions dataset_zoo/xfund/de/re.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
_base_ = ['ser.py']

_base_.train_preparer.packer.type = 'REPacker'
_base_.test_preparer.packer.type = 'REPacker'

config_generator = dict(type='XFUNDREConfigGenerator')
70 changes: 70 additions & 0 deletions dataset_zoo/xfund/de/sample_anno.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
**Semantic Entity Recognition / Relation Extraction**

```json
{
"lang": "zh",
"version": "0.1",
"split": "val",
"documents": [
{
"id": "zh_val_0",
"uid": "0ac15750a098682aa02b51555f7c49ff43adc0436c325548ba8dba560cde4e7e",
"document": [
{
"box": [
410,
541,
535,
590
],
"text": "夏艳辰",
"label": "answer",
"words": [
{
"box": [
413,
541,
447,
587
],
"text": "夏"
},
{
"box": [
458,
542,
489,
588
],
"text": "艳"
},
{
"box": [
497,
544,
531,
590
],
"text": "辰"
}
],
"linking": [
[
30,
26
]
],
"id": 26
},
// ...
],
"img": {
"fname": "zh_val_0.jpg",
"width": 2480,
"height": 3508
}
},
// ...
]
}
```
60 changes: 60 additions & 0 deletions dataset_zoo/xfund/de/ser.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
lang = 'de'
data_root = f'data/xfund/{lang}'
cache_path = 'data/cache'

train_preparer = dict(
obtainer=dict(
type='NaiveDataObtainer',
cache_path=cache_path,
files=[
dict(
url='https://github.com/doc-analysis/XFUND/'
f'releases/download/v1.0/{lang}.train.zip',
save_name=f'{lang}_train.zip',
md5='8c9f949952d227290e22f736cdbe4d29',
content=['image'],
mapping=[[f'{lang}_train/*.jpg', 'imgs/train']]),
dict(
url='https://github.com/doc-analysis/XFUND/'
f'releases/download/v1.0/{lang}.train.json',
save_name=f'{lang}_train.json',
md5='3e4b95c7da893bf5a91018445c83ccdd',
content=['annotation'],
mapping=[[f'{lang}_train.json', 'annotations/train.json']])
]),
gatherer=dict(
type='MonoGatherer', ann_name='train.json', img_dir='imgs/train'),
parser=dict(type='XFUNDAnnParser'),
packer=dict(type='SERPacker'),
dumper=dict(type='JsonDumper'),
)

test_preparer = dict(
obtainer=dict(
type='NaiveDataObtainer',
cache_path=cache_path,
files=[
dict(
url='https://github.com/doc-analysis/XFUND/'
f'releases/download/v1.0/{lang}.val.zip',
save_name=f'{lang}_val.zip',
md5='d13d12278d585214183c3cfb949b0e59',
content=['image'],
mapping=[[f'{lang}_val/*.jpg', 'imgs/test']]),
dict(
url='https://github.com/doc-analysis/XFUND/'
f'releases/download/v1.0/{lang}.val.json',
save_name=f'{lang}_val.json',
md5='8eaf742f2d19b17f5c0e72da5c7761ef',
content=['annotation'],
mapping=[[f'{lang}_val.json', 'annotations/test.json']])
]),
gatherer=dict(
type='MonoGatherer', ann_name='test.json', img_dir='imgs/test'),
parser=dict(type='XFUNDAnnParser'),
packer=dict(type='SERPacker'),
dumper=dict(type='JsonDumper'),
)

delete = ['annotations'] + [f'{lang}_{split}' for split in ['train', 'val']]
config_generator = dict(type='XFUNDSERConfigGenerator')
41 changes: 41 additions & 0 deletions dataset_zoo/xfund/es/metafile.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
Name: 'XFUND'
Paper:
Title: 'XFUND: A Benchmark Dataset for Multilingual Visually Rich Form Understanding'
URL: https://aclanthology.org/2022.findings-acl.253
Venue: ACL
Year: '2022'
BibTeX: '@inproceedings{xu-etal-2022-xfund,
title = "{XFUND}: A Benchmark Dataset for Multilingual Visually Rich Form Understanding",
author = "Xu, Yiheng and
Lv, Tengchao and
Cui, Lei and
Wang, Guoxin and
Lu, Yijuan and
Florencio, Dinei and
Zhang, Cha and
Wei, Furu",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2022",
month = may,
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.findings-acl.253",
doi = "10.18653/v1/2022.findings-acl.253",
pages = "3214--3224",
abstract = "Multimodal pre-training with text, layout, and image has achieved SOTA performance for visually rich document understanding tasks recently, which demonstrates the great potential for joint learning across different modalities. However, the existed research work has focused only on the English domain while neglecting the importance of multilingual generalization. In this paper, we introduce a human-annotated multilingual form understanding benchmark dataset named XFUND, which includes form understanding samples in 7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese). Meanwhile, we present LayoutXLM, a multimodal pre-trained model for multilingual document understanding, which aims to bridge the language barriers for visually rich document understanding. Experimental results show that the LayoutXLM model has significantly outperformed the existing SOTA cross-lingual pre-trained models on the XFUND dataset. The XFUND dataset and the pre-trained LayoutXLM model have been publicly available at https://aka.ms/layoutxlm.",
}'
Data:
Website: https://github.com/doc-analysis/XFUND
Language:
- Chinese, Japanese, Spanish, French, Italian, German, Portuguese
Scene:
- Document
Granularity:
- Word
Tasks:
- ser
- re
License:
Type: CC BY 4.0
Link: https://creativecommons.org/licenses/by/4.0/
Format: .json
6 changes: 6 additions & 0 deletions dataset_zoo/xfund/es/re.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
_base_ = ['ser.py']

_base_.train_preparer.packer.type = 'REPacker'
_base_.test_preparer.packer.type = 'REPacker'

config_generator = dict(type='XFUNDREConfigGenerator')
70 changes: 70 additions & 0 deletions dataset_zoo/xfund/es/sample_anno.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
**Semantic Entity Recognition / Relation Extraction**

```json
{
"lang": "zh",
"version": "0.1",
"split": "val",
"documents": [
{
"id": "zh_val_0",
"uid": "0ac15750a098682aa02b51555f7c49ff43adc0436c325548ba8dba560cde4e7e",
"document": [
{
"box": [
410,
541,
535,
590
],
"text": "夏艳辰",
"label": "answer",
"words": [
{
"box": [
413,
541,
447,
587
],
"text": "夏"
},
{
"box": [
458,
542,
489,
588
],
"text": "艳"
},
{
"box": [
497,
544,
531,
590
],
"text": "辰"
}
],
"linking": [
[
30,
26
]
],
"id": 26
},
// ...
],
"img": {
"fname": "zh_val_0.jpg",
"width": 2480,
"height": 3508
}
},
// ...
]
}
```
60 changes: 60 additions & 0 deletions dataset_zoo/xfund/es/ser.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
lang = 'es'
data_root = f'data/xfund/{lang}'
cache_path = 'data/cache'

train_preparer = dict(
obtainer=dict(
type='NaiveDataObtainer',
cache_path=cache_path,
files=[
dict(
url='https://github.com/doc-analysis/XFUND/'
f'releases/download/v1.0/{lang}.train.zip',
save_name=f'{lang}_train.zip',
md5='0ff89032bc6cb2e7ccba062c71944d03',
content=['image'],
mapping=[[f'{lang}_train/*.jpg', 'imgs/train']]),
dict(
url='https://github.com/doc-analysis/XFUND/'
f'releases/download/v1.0/{lang}.train.json',
save_name=f'{lang}_train.json',
md5='b40b43f276c7deaaaa5923d035da2820',
content=['annotation'],
mapping=[[f'{lang}_train.json', 'annotations/train.json']])
]),
gatherer=dict(
type='MonoGatherer', ann_name='train.json', img_dir='imgs/train'),
parser=dict(type='XFUNDAnnParser'),
packer=dict(type='SERPacker'),
dumper=dict(type='JsonDumper'),
)

test_preparer = dict(
obtainer=dict(
type='NaiveDataObtainer',
cache_path=cache_path,
files=[
dict(
url='https://github.com/doc-analysis/XFUND/'
f'releases/download/v1.0/{lang}.val.zip',
save_name=f'{lang}_val.zip',
md5='efad9fb11ee3036bef003b6364a79ac0',
content=['image'],
mapping=[[f'{lang}_val/*.jpg', 'imgs/test']]),
dict(
url='https://github.com/doc-analysis/XFUND/'
f'releases/download/v1.0/{lang}.val.json',
save_name=f'{lang}_val.json',
md5='96ffc2057049ba2826a005825b3e7f0d',
content=['annotation'],
mapping=[[f'{lang}_val.json', 'annotations/test.json']])
]),
gatherer=dict(
type='MonoGatherer', ann_name='test.json', img_dir='imgs/test'),
parser=dict(type='XFUNDAnnParser'),
packer=dict(type='SERPacker'),
dumper=dict(type='JsonDumper'),
)

delete = ['annotations'] + [f'{lang}_{split}' for split in ['train', 'val']]
config_generator = dict(type='XFUNDSERConfigGenerator')
Loading