This is work in progress. The goal is creating a NLP ground truth corpus based on the OCR ground truth data for the historical newspaper Deutscher Reichsanzeiger und Preußischer Staatsanzeiger (1819-1945). It was scanned and OCR-ed at UB Mannheim.
- ✅ Convert the unprocessed text lines from Reichsanzeiger PAGE XML files to separate lines in TXT files [via blatt to_txt]. See
data/text_raw/
. - ✅ Remove hyphens & line breaks from the text lines from Reichsanzeiger files and save them as plain text in TXT files [via blatt to_txt]. See data/text_unhyphenated/.
- ✅ Split plain text without line breaks & without hyphens into sentences & save it as one sentence per line TSV files [via blatt to_tsv]. See data/sentences_raw/.
- ✅ Correct sentence splitting manually and remove "noisy data" (e.g., tables). See data/sentences_checked/.
- ✅ Import plain text (one sentence per line) to INCEpTION
- ✅ Create the annotation guidelines
- ✅ Create a tagset and annotation layer in INCEpTION according to the annotation guidelines. See inception/tagsets/ and inception/layers.
- ✅ Annotate plain text according to the annotation guidelines
- ✅ Export the annotations in INCEpTION formats (e.g., UIMA CAS XMI). See
data/
. - ✅ Create a convertor from XMI to IOB format and convert XMI files into IOB files (created cas2iob)
- ⏳ Curate the annotations from two annotators
- 🔜 Train baseline models for NER/NEL
We tested INCEpTION, neat and MedTator. INCEpTION is chosen as the most advanced among them.
When we annotate old German plain text in INCEpTION and MedTator and export annotations in IOB format, tokenization is often incorrect. In these cases one can use neat as tokenization corrector.
If we import plain text with one sentence per line instead of just plain text into INCEpTION, the annotations exported into IOB format have a decent quality of tokenization.
We decided to develop the annotation guidelines iteratively based on the existing annotation guidelines for historical German texts as well as via analysing the sample pages from the Reichsanzeiger.
Dataset | Text type | Century | Project | Annotation Guidelines | Annotation Tool | Tasks | Tagset | License |
---|---|---|---|---|---|---|---|---|
AjMC | Commentaries | XIX | Ajax MultiCommentary | Zenodo | INCEpTION | NER, NEL | pers, work, loc, object, date, scope | |
HIPE-2020 | Newspaper | mid XIX - mid XX | CLEF-HIPE-2020 | Zenodo | INCEpTION | NER, NEL | pers, org, prod, time, loc | |
Newseye | Newspaper | mid XIX - mid XX | Newseye | Zenodo | Transkribus | NER, NEL | PER, LOC, ORG, HumanProd | |
SoNAR | Newspaper | mid XIX - mid XX | SoNAR | Zenodo | neat | NER, NEL | PER, LOC, ORG |