This repository contains the code required for building the ParlaMint-RO
corpus.
Each script in the root of this repository represents a step in the processing pipeline. Theese scripts can be classified as follows:
build-speakers-list.py
- iterates through session transcripts inJSON
format and builds a list of unique speaker names, which is then saved to aCSV
file.classify-speakers.py
- iterates through session transcripts inJSON
format and classifies speakers into MPs and invited speakers; the lists are saved inCSV
format.
The script to build the corpus is build-corpus.py
.