The goal is to create a diachronic corpus of German Reichstag (1867-1942) and Bundestag (1949-2021, 19 sessions) protocols.
The diachronic corpus should have the following slices:
- 1-KR1: Kaiserreich 1 (1867-1890)
- 2-KR2: Kaiserreich 2 (1890-1918)
- 3-WR: Weimarer Republik (1918-1933)
- 4-NS: Nationalsozialismus (1933-1942)
- 5-CDU1: CDU 1 (sessions 1, 2, 3, 4, 5)
- 6-SPD1: SPD 1 (sessions 6, 7, 8, 9)
- 7-CDU2: CDU 2 (sessions 10, 11, 12, 13)
- 8-SPD2: SPD 2 (sessions 14, 15)
- 9-CDU3: CDU 3 (sessions 16, 17, 18, 19)
This has been done before in this paper:
Diachronic Analysis of German Parliamentary Proceedings: Ideological Shifts through the Lens of Political Biases
- https://arxiv.org/abs/2108.06295
- https://github.com/tobiwalter/Investigating-Antisemitic-Bias-in-German-Parliamentary-Proceedings
- https://tudatalib.ulb.tu-darmstadt.de/handle/tudatalib/2889?show=full
However, their dataset creation process builds on source data that we cannot determine how it has been processed.
With this project, the goal is to create a cleaner and improved dataset where every step of the creation process is transparent.
You can find the original repository including the data on slurm: /storage/nllg/compute-share/bodensohn/deuparl/DeuParl
You can find the new repository including the data on slurm: /ukp-storage-1/vu/ocr_spelling_deuparl
The corpus creation pipeline has the following steps:
The Reichstag protocols and the Bundestag protocols come from two separate sources:
- Reichstag protocols: https://www.reichstagsprotokolle.de
- Bundestag protocols: https://www.bundestag.de/services/opendata
Because of this, the data collection is different for the Reichstag and Bundestag protocols.
Reichstag Protocols:
The raw source data is located in /storage/nllg/compute-share/eger/melvin/reichstagsprotokolle
(~500GB). It is made up of bsbXXXXXXXX
folders, each of which contains a single xml
folder which
itself contains many bsbXXXXXXXX_XXXXX.xml
documents. These .xml
files each comprise a single page of parliament protocols obtained via optical character recognition. They include the line breaks
from the original pages.
Furthermore, there is a "Konkordanz" (BSB_Reichstagsprotkolle_Konkordanz.csv
) which states the year each of the bsbXXXXXXXX
folders belongs.
The Python script 1_collect_reichstag.py
goes through the raw source data, extracts the raw text from the .xml
files and uses the "Konkordanz" to create one bucket of documents for each year
in data/1_collected/Reichstag/
. Furthermore, it can (optionally) apply some preprocessing corrections.
- This step is based on this project: https://github.com/SteffenEger/Corpus (no longer available)
- The source data has been obtained from:
/storage/nllg/compute-share/eger/melvin/reichstagsprotokolle/
Bundestag Protocols:
The raw data is located in data/source/Bundestag
. For each session (1-19), there is a folder of .xml
files.
- This step is based on this project: https://github.com/SteffenEger/bundestagsprotokolle (no longer available)
The second step is to preprocess the collected documents. This could, for example, be removing noise by filtering out the start and end of the documents that are not actually part of the protocol (check out tobiwalter_process_reichstag_data.py
and tobiwalter_process_reichstag_data.py
).
Reichstag Protocols:
The Python script 2_preprocess_reichstag.py
goes through the collected documents in data/1_collected/Reichstag/
, preprocesses them, and stores the result in data/2_preprocessed/Reichstag/
.
Bundestag Protocols:
The Python script 2_preprocess_bundestag.py
goes through the collected documents in data/1_collected/Bundestag/
, preprocesses them, and stores the result in data/2_preprocessed/Bundestag/
.
Since the Reichstag protocols have been digitized using optical character recognition, they contain character errors (e.g., "l" instead of "i"). The goal of OCR post-correction is to fix these errors. The idea behind Spelling normalization is to map multiple (historical) spellings of a word (e.g., "Theil") to a single canonical, current form (e.g., "Teil").
Some ground truth training data was generated by fixing character errors by hand (see data/ocr_post_correction/raw_training_data
), which are combined to a single file of training instances (data/ocr_post_correction/data.csv
). We also extend ground truth training data to Spelling data/5_postprocessed/Reichstag/
Normalization (data/ocr_post_correction/data_norma.csv
)
Reichstag Protocols:
The Python script 3_ocr_post_correct_spelling_normalization_reichstag.py
uses the documents in data/2_preprocessed/Reichstag/
, applies OCR post-correction & Spelling normalization to them, and stores them in data/3_ocr_post_corrected_spelling_normalization/Reichstag/
.
Bundestag Protocols:
The Python script 3_spelling_normalization_bundestag.py
uses the documents in data/2_preprocessed/Bundestag/
, applies Spelling normalization to them, and stores them in data/3_ocr_post_corrected_spelling_normalization/Bundestag/
.
Issues
Run through the whole corpus to generate data would take a long time (probably more than a month with NVIDIA A100) We recommend using DeuParl API below to generate individual text or a batch
The final step partitions the documents into slices.
Reichstag Protocols:
The Python script 4_slice_reichstag.py
uses the documents in data/3_ocr_post_corrected_spelling_normalization/Reichstag/
, creates the first four slices, and stores them in data/4_sliced/
.
Bundestag Protocols:
The Python script 4_slice_bundestag.py
uses the documents in data/3_ocr_post_corrected_spelling_normalization/Bundestag/
, creates the last five slices, and stores them in data/4_sliced/
.
The folder code_from_other_projects
contains scripts that have been obtained from other projects and may include some helpful snippets (e.g., regular expressions).
tobiwalter_process_reichstag_data.py
and tobiwalter_process_reichstag_data.py
summarize the preprocessing step as done by Tobias Walter in https://arxiv.org/abs/2108.06295.
Since they were written for different source data, they cannot be easily integrated into this project.
We used MBART and M2M100_418M for this project. We release our finetuned models in data/models
. Only m2m100_418M-for-ocr-post-correction-norma-model-50
is the model which are fine-tuned on OCR post-correction + Spelling normalization dataset. Others are for OCR post-correction only.
evaluation_ocr_post_correction.py
evaluates OCR post-correction models on non-concatenated data (See data/ocr_post_correction/raw_training_data
). Usage:
python evaluation_ocr_post_correction.py --data {1873, 1900, 1914, 1916} --model {m2m100_418M, mbart}
eval_spell.py
evaluates the m2m100_418M-for-ocr-post-correction-norma-model-50
model for 10% of data_norma.csv
.
We provide a simple API to load our DeuParl data quickly.
Example:
import deuparl.Reichstag
year = 1890
session = 1
data = Reischtag(1890, 1)
preprocessed_data = data.get_clean()
post_ocr_norma_data = data.get_post_ocr_norma()
We provide a CLI tool for individual OCR post-correction and Spelling normalization generation. The CLI tool can take many years as input but only one session. Example:
python deuparl.py -years 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 -session 2
slurm.ukp.informatik.tu-darmstadt.de
- Original data:
/storage/nllg/compute-share/bodensohn/deuparl/Deuparl
Ensure to install torch
in the virtual environment to use CUDA (https://pytorch.org/).
Check out the UKP Wiki for more information.