This repository contains the Python scripts supporting the experiments and analysis presented in our paper:
Debating Regional Challenges: Insights into the Carniolan Provincial Assembly in the Austro-Hungarian Empire
Presented at DH2025 conference in Lisbon, July 2025
DOI: [not available yet]
This repository provides three Python scripts used to prepare the data, extract the topics, and run topic analysis over the years for the Kranjska 1.0 corpus, as described in the paper. The source corpus has to be downloaded from external repository (see below).
src
├── extract_lemmas.py # Prepares input data for topic postprocessing
├── prepare_data_speeches.py # Prepares input data for topic analysis
├── topics_time.py # Main script to run topic analysis, prints the results and generates visualisations
The corpus Kranjska 1.0 is publicly available on CLARIN.SI under CC BY 4.0 license. URL: http://hdl.handle.net/11356/1824
Two zip files have to be downloaded:
- Kranjska corpus in TEI format: Kranjska-xml-text.zip (31.12 MB), needed for topic analysis
- Kranjska corpus in TEI with linguistic annotation: Kranjska-xml.zip (157.91 MB), needed for postprocessing, where the topics' keywords are limited to lemmas to avoid repetition of word forms
Unzip both files in the ./corpus/
directory.
- Download external files and place them in the appropriate folders.
- Change the names of input and output files and folders in all Python scripts (if needed).
- Prepare the data for topic analysis (generate ./data/bert_docs_time_stamps.json file):
python prepare_data_speeches.py
- Prepare the data for postprocessing (generate ./data/word_lemmas.json file):
python extract_lemmas.py
- Install the BERTopic library and all dependencies, if needed.
- Run the main analysis and plot results:
python topics_time.py
For questions or feedback, feel free to reach out:
Alenka Kavčič
[[email protected]]
This code is released under the GPL 3.0 or later license. See LICENSE
for details.