GitHub - UL-FRI-LGM/Kranjska-Topics: Python scripts for data preparation and extraction of topics from the Kranjska 1.0 corpus.

Kranjska-Topics

This repository contains the Python scripts supporting the experiments and analysis presented in our paper:

Debating Regional Challenges: Insights into the Carniolan Provincial Assembly in the Austro-Hungarian Empire
Presented at DH2025 conference in Lisbon, July 2025
DOI: [not available yet]

Overview

This repository provides three Python scripts used to prepare the data, extract the topics, and run topic analysis over the years for the Kranjska 1.0 corpus, as described in the paper. The source corpus has to be downloaded from external repository (see below).

Carniolan Provincial Assembly corpus Kranjska 1.0

The corpus Kranjska 1.0 is publicly available on CLARIN.SI under CC BY 4.0 license. URL: http://hdl.handle.net/11356/1824

Two zip files have to be downloaded:

Kranjska corpus in TEI format: Kranjska-xml-text.zip (31.12 MB), needed for topic analysis
Kranjska corpus in TEI with linguistic annotation: Kranjska-xml.zip (157.91 MB), needed for postprocessing, where the topics' keywords are limited to lemmas to avoid repetition of word forms

Unzip both files in the ./corpus/ directory.

How to Run

Download external files and place them in the appropriate folders.
Change the names of input and output files and folders in all Python scripts (if needed).
Prepare the data for topic analysis (generate ./data/bert_docs_time_stamps.json file):
```
python prepare_data_speeches.py
```
Prepare the data for postprocessing (generate ./data/word_lemmas.json file):
```
python extract_lemmas.py
```
Install the BERTopic library and all dependencies, if needed.
Run the main analysis and plot results:
```
python topics_time.py
```

Contact

For questions or feedback, feel free to reach out:

Alenka Kavčič
[[email protected]]

License

This code is released under the GPL 3.0 or later license. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSES		LICENSES
src		src
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Kranjska-Topics

Overview

Contents

Carniolan Provincial Assembly corpus Kranjska 1.0

How to Run

Contact

License

About

Uh oh!

Releases

Packages

Languages

License

UL-FRI-LGM/Kranjska-Topics

Folders and files

Latest commit

History

Repository files navigation

Kranjska-Topics

Overview

Contents

Carniolan Provincial Assembly corpus Kranjska 1.0

How to Run

Contact

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages