Skip to content

SteffenEger/ocr_spelling_deuparl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DeuParl Corpus

The goal is to create a diachronic corpus of German Reichstag (1867-1942) and Bundestag (1949-2021, 19 sessions) protocols.

The diachronic corpus should have the following slices:

  1. 1-KR1: Kaiserreich 1 (1867-1890)
  2. 2-KR2: Kaiserreich 2 (1890-1918)
  3. 3-WR: Weimarer Republik (1918-1933)
  4. 4-NS: Nationalsozialismus (1933-1942)
  5. 5-CDU1: CDU 1 (sessions 1, 2, 3, 4, 5)
  6. 6-SPD1: SPD 1 (sessions 6, 7, 8, 9)
  7. 7-CDU2: CDU 2 (sessions 10, 11, 12, 13)
  8. 8-SPD2: SPD 2 (sessions 14, 15)
  9. 9-CDU3: CDU 3 (sessions 16, 17, 18, 19)

This has been done before in this paper:

Diachronic Analysis of German Parliamentary Proceedings: Ideological Shifts through the Lens of Political Biases

However, their dataset creation process builds on source data that we cannot determine how it has been processed.

With this project, the goal is to create a cleaner and improved dataset where every step of the creation process is transparent.

Corpus Creation Pipeline

You can find the original repository including the data on slurm: /storage/nllg/compute-share/bodensohn/deuparl/DeuParl You can find the new repository including the data on slurm: /ukp-storage-1/vu/ocr_spelling_deuparl

The corpus creation pipeline has the following steps:

1. Data Collection and Preprocessing

The Reichstag protocols and the Bundestag protocols come from two separate sources:

Because of this, the data collection is different for the Reichstag and Bundestag protocols.

Reichstag Protocols:

The raw source data is located in /storage/nllg/compute-share/eger/melvin/reichstagsprotokolle (~500GB). It is made up of bsbXXXXXXXX folders, each of which contains a single xml folder which itself contains many bsbXXXXXXXX_XXXXX.xml documents. These .xml files each comprise a single page of parliament protocols obtained via optical character recognition. They include the line breaks from the original pages.

Furthermore, there is a "Konkordanz" (BSB_Reichstagsprotkolle_Konkordanz.csv) which states the year each of the bsbXXXXXXXX folders belongs.

The Python script 1_collect_reichstag.py goes through the raw source data, extracts the raw text from the .xml files and uses the "Konkordanz" to create one bucket of documents for each year in data/1_collected/Reichstag/. Furthermore, it can (optionally) apply some preprocessing corrections.

  • This step is based on this project: https://github.com/SteffenEger/Corpus (no longer available)
  • The source data has been obtained from: /storage/nllg/compute-share/eger/melvin/reichstagsprotokolle/

Bundestag Protocols:

The raw data is located in data/source/Bundestag. For each session (1-19), there is a folder of .xml files.

2. Preprocessing

The second step is to preprocess the collected documents. This could, for example, be removing noise by filtering out the start and end of the documents that are not actually part of the protocol (check out tobiwalter_process_reichstag_data.py and tobiwalter_process_reichstag_data.py).

Reichstag Protocols:

The Python script 2_preprocess_reichstag.py goes through the collected documents in data/1_collected/Reichstag/, preprocesses them, and stores the result in data/2_preprocessed/Reichstag/.

Bundestag Protocols:

The Python script 2_preprocess_bundestag.py goes through the collected documents in data/1_collected/Bundestag/, preprocesses them, and stores the result in data/2_preprocessed/Bundestag/.

3. OCR post-correction & Spelling normalization

Since the Reichstag protocols have been digitized using optical character recognition, they contain character errors (e.g., "l" instead of "i"). The goal of OCR post-correction is to fix these errors. The idea behind Spelling normalization is to map multiple (historical) spellings of a word (e.g., "Theil") to a single canonical, current form (e.g., "Teil").

Some ground truth training data was generated by fixing character errors by hand (see data/ocr_post_correction/raw_training_data), which are combined to a single file of training instances (data/ocr_post_correction/data.csv). We also extend ground truth training data to Spelling data/5_postprocessed/Reichstag/Normalization (data/ocr_post_correction/data_norma.csv)

Reichstag Protocols:

The Python script 3_ocr_post_correct_spelling_normalization_reichstag.py uses the documents in data/2_preprocessed/Reichstag/, applies OCR post-correction & Spelling normalization to them, and stores them in data/3_ocr_post_corrected_spelling_normalization/Reichstag/.

Bundestag Protocols:

The Python script 3_spelling_normalization_bundestag.py uses the documents in data/2_preprocessed/Bundestag/, applies Spelling normalization to them, and stores them in data/3_ocr_post_corrected_spelling_normalization/Bundestag/.

Issues

Run through the whole corpus to generate data would take a long time (probably more than a month with NVIDIA A100) We recommend using DeuParl API below to generate individual text or a batch

4. Slicing

The final step partitions the documents into slices.

Reichstag Protocols:

The Python script 4_slice_reichstag.py uses the documents in data/3_ocr_post_corrected_spelling_normalization/Reichstag/, creates the first four slices, and stores them in data/4_sliced/.

Bundestag Protocols:

The Python script 4_slice_bundestag.py uses the documents in data/3_ocr_post_corrected_spelling_normalization/Bundestag/, creates the last five slices, and stores them in data/4_sliced/.

Other Resources

The folder code_from_other_projects contains scripts that have been obtained from other projects and may include some helpful snippets (e.g., regular expressions).

tobiwalter_process_reichstag_data.py and tobiwalter_process_reichstag_data.py summarize the preprocessing step as done by Tobias Walter in https://arxiv.org/abs/2108.06295.

Since they were written for different source data, they cannot be easily integrated into this project.

Model

We used MBART and M2M100_418M for this project. We release our finetuned models in data/models. Only m2m100_418M-for-ocr-post-correction-norma-model-50 is the model which are fine-tuned on OCR post-correction + Spelling normalization dataset. Others are for OCR post-correction only.

Evaluation

OCR post-correction

evaluation_ocr_post_correction.py evaluates OCR post-correction models on non-concatenated data (See data/ocr_post_correction/raw_training_data). Usage:

python evaluation_ocr_post_correction.py --data {1873, 1900, 1914, 1916} --model {m2m100_418M, mbart}

OCR post-correction & Spelling normalization

eval_spell.py evaluates the m2m100_418M-for-ocr-post-correction-norma-model-50 model for 10% of data_norma.csv.

DeuParl API

We provide a simple API to load our DeuParl data quickly.

Example:

import deuparl.Reichstag

year = 1890
session = 1

data = Reischtag(1890, 1)

preprocessed_data = data.get_clean()

post_ocr_norma_data = data.get_post_ocr_norma()

We provide a CLI tool for individual OCR post-correction and Spelling normalization generation. The CLI tool can take many years as input but only one session. Example:

python deuparl.py -years 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 -session 2 

Slurm Cluster

  • slurm.ukp.informatik.tu-darmstadt.de
  • Original data: /storage/nllg/compute-share/bodensohn/deuparl/Deuparl

Ensure to install torch in the virtual environment to use CUDA (https://pytorch.org/).

Check out the UKP Wiki for more information.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages