GitHub - taineleau/MONSTERMASH: [IWCP 24] MONSTERMASH: Multidirectional, Overlapping, Nested, Spiral Text Extraction for Recognition Models of Arabic-Script Handwriting

MONSTERMASH

Multidirectional, Overlapping, Nested, Spiral Text Extraction for Recognition Models of Arabic-Script Handwriting

The data is collected from 10th centuries to 20th. Writing sytle included but not limited to Naskh (varied style), Nasta'liq, Maghribi, Sudani, Ta'liq and Bihari.

For example, the follow image:

{
  "pid": 26,
 "title": "Hartford Seminary Arabic MSS 199/Jawāhiral-nuṣūṣ fī ḥall kalimāt al-Fuṣūṣ",	
"time": "18th century",
"language": "Arabic",
"style: "Naskh",
"note": "Complex title page with multiple ownersip notes etc and other annotations",
"polygons": [(x0, y0), (x1, y1), ..., ]

}

Usage

The dataset is under data/. We use labelme to annotate the dataset.

Three baseline models

kraken

Install https://github.com/mittagessen/kraken, and we use 4.3 in our project.
Download the model:

wget https://github.com/OpenITI/arabic_script_ocr_models/blob/main/ms_mellon_print_layout.mlmodel

Run script at models/kraken_pipe.ipynb for inference code.

TESTR

Install instruction can be found: https://github.com/mlpc-ucsd/TESTR

doc-ufcn

Install instruction can be found: https://gitlab.teklia.com/dla/doc-ufcn (you may need to register first to visit their gitlab).

Checkpoints is available on huggingface: https://huggingface.co/Teklia/doc-ufcn-generic-historical-line

Citaion

@inproceedings{chen2024monstermash,
  title={{MONSTERMASH}: Multidirectional, Overlapping, Nested, Spiral Text Extraction for Recognition Models of Arabic-Script Handwriting},
  author={Chen, Danlu and Murel, Jacob and Taimoor Shahid and Xiang Zhang and Jonathan Parkes Allen and Taylor Berg-Kirkpatrick and and David A.Smith},
  booktitle={International Conference on Document Analysis and Recognition, IWCP workshop},
  year={2024},
  organization={Springer}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
assets		assets
data		data
models		models
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MONSTERMASH

Multidirectional, Overlapping, Nested, Spiral Text Extraction for Recognition Models of Arabic-Script Handwriting

Usage

Three baseline models

kraken

TESTR

doc-ufcn

Citaion

About

Releases

Packages

Languages

taineleau/MONSTERMASH

Folders and files

Latest commit

History

Repository files navigation

MONSTERMASH

Multidirectional, Overlapping, Nested, Spiral Text Extraction for Recognition Models of Arabic-Script Handwriting

Usage

Three baseline models

kraken

TESTR

doc-ufcn

Citaion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages