Skip to content

Commit

Permalink
Merge pull request #810 from clarin-eric/main
Browse files Browse the repository at this point in the history
data-main
  • Loading branch information
matyaskopp authored Oct 11, 2023
2 parents d38275c + ff8e532 commit 1695edb
Show file tree
Hide file tree
Showing 82 changed files with 7,749 additions and 628 deletions.
23 changes: 23 additions & 0 deletions Corpora/Docs/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
## Copy and rename README and registry files so they can be included in a distribution

cp_all: cp_readmes cp_regis

README_CORPORA = AT BA BE BG CZ DK EE ES ES-CT ES-GA ES-PV FI FR GB GR HR HU IS IT LV NL NO PL PT RS SE SI TR UA
REGIST_VERSION = 40
REGIST_CORPORA = at ba be bg cz dk ee es es_ct es_ga es_pv fi fr gb gr hr hu is it lv nl no pl pt rs se si tr ua

# Main readmes from Data/, store here and rename
cp_readmes:
rm -f README.md/*.md
for CORPUS in ${README_CORPORA}; do \
cp ../../Samples/ParlaMint-$${CORPUS}/README.md \
README.md/README-$${CORPUS}.md; \
done;

# Registry files
cp_regis:
rm -f registry/*
for CORPUS in ${REGIST_CORPORA}; do \
cp /project/clarinsi-cqp/registry/parlamint${REGIST_VERSION}_$${CORPUS} \
registry/parlamint${REGIST_VERSION}_$${CORPUS}; \
done;
12 changes: 12 additions & 0 deletions Corpora/Docs/README-en.TEI.ana.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
Comparable parliamentary corpus
ParlaMint-XX.ana ZZ
TEI linguistically annotated version

Citation, documentation, download, and licence available from
YY

This directory contains the ParlaMint/TEI linguistically annotated version of
the machine translated ParlaMint-XX.ana corpus. The root file ParlaMint-XX.ana.xml
contains the corpus teiHeader and XIncludes of the component files, which have the form
ParlaMint-XX_<suffix>.ana.xml. Also included are the XML schemas of the
corpus.
11 changes: 11 additions & 0 deletions Corpora/Docs/README-en.TEI.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
Comparable parliamentary corpus
ParlaMint-XX ZZ
TEI encoded version

Citation, documentation, download, and licence available from
YY

This directory contains the ParlaMint/TEI encoded version of the machine
translated ParlaMint-XX corpus. The root file ParlaMint-XX.xml contains
the corpus teiHeader and XIncludes of the component files, which have the
form ParlaMint-XX_<suffix>.xml.
14 changes: 14 additions & 0 deletions Corpora/Docs/README-en.conll.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
Comparable parliamentary corpus
ParlaMint-XX.ana ZZ
Derived CoNLL-U encoded corpus with TSV metadata

Citation, documentation, download, and licence available from
YY

This directory contains the CoNLL-U encoded machine translated ParlaMint-XX.ana corpus,
which was automatically coverted from its linguistically encoded TEI.ana version. The
corpus also contains NER annotation in the IOB format. Additionally, each
CoNLL-U file has an associated TSV file giving the metadata of its speeches.

Note that the CoNLL-U files do not contain all the information from the source
corpus, in particular, the transcriber comments are not included.
55 changes: 55 additions & 0 deletions Corpora/Docs/README-en.schema.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
Comparable parliamentary corpora
ParlaMint ZZ
Parla-CLARIN and ParlaMint schemas with conversion scripts

Citation, documentation, download, and licence available from
YY

This directory contains:

1. The Parla-CLARIN schema and documentation, archived from
https://github.com/clarin-eric/parla-clarin. This schema was used
as the overall frame in which the ParlaMint corpora were encoded,
and is, in this context, useful mostly for its documentation that
can be found in the Parla-CLARIN/docs directory.

2. The ParlaMint schemas, which were made just for ParlaMint corpora
and attempt to maximally constrain the encoding to be exactly that
as used by the ParlaMint. More about these below.

3. Some XSLT scripts in directory bin/ that can be used to convert
ParlaMint TEI encoded files into other formats, such as plain text,
CoNLL-U and vertical files.

ParlaMint schemas

The ParlaMint schemas are available in RelaxNG XML format (.rng),
RelaxNG compact format (.rnc) and in the W3C Schema language
(.xsd). As ParlaMint corpora are too large to be validated as one
document, and, furthermore, exist in two versions (the "plain text"
one, and the linguistically annotated one) there are four schemas for
validation:

- ParlaMint-TEI, used to validate component (TEI rooted) files of a
ParlaMint corpus. This schema also contains most of the definitions
that are imported by the other schemas.

- ParlaMint-teiCorpus, used to validate the top-level (teiCorpus
rooted) file of a ParlaMint corpus. The file should contain
XIncludes of the corpus components.

- ParlaMint-TEI.ana, used to validate component (TEI rooted) files of
the linguistically annotated version of a ParlaMint corpus.

- ParlaMint-teiCorpus.ana, used to validate the top-level (teiCorpus
rooted) file of the linguistically annotated version of a ParlaMint
corpus.

For validating the ParlaMint corpus of country XX using standard
ParlaMint names for directories and files, a validation run under Unix
using jing installed at /usr/share/java/ would be:

$ java -jar /usr/share/java/jing.jar ParlaMint-teiCorpus.rng ParlaMint-XX/ParlaMint-XX.xml
$ java -jar /usr/share/java/jing.jar ParlaMint-TEI.rng ParlaMint-XX/ParlaMint-XX_*.xml
$ java -jar /usr/share/java/jing.jar ParlaMint-teiCorpus.ana.rng ParlaMint-XX.ana/ParlaMint-XX.ana.xml
$ java -jar /usr/share/java/jing.jar ParlaMint-TEI.ana.rng ParlaMint-XX.ana/ParlaMint-XX_*ana.xml
13 changes: 13 additions & 0 deletions Corpora/Docs/README-en.text.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
Comparable parliamentary corpus
ParlaMint-XX ZZ
Derived plain-text corpus with TSV metadata

Citation, documentation, download, and licence available from
YY

This directory contains the machine translated ParlaMint-XX corpus as plain text,
which was automatically coverted from the TEI encoded version of the corpus. The text
files have two TAB-separated columns, the first one giving the ID of the speech,
and the second the plain text of the speech. Transcriber comments are given in
double square brackets. Additionally, each text file has an associated TSV file
giving the metadata for each speech.
18 changes: 18 additions & 0 deletions Corpora/Docs/README-en.vert.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
Comparable parliamentary corpus
ParlaMint-XX ZZ
Version with derived vertically encoded corpus

Citation, documentation, download, and licence available from
YY

This directory contains the so called vertical files (the format used by the CQP
and (no)Sketch Engine concordancers), which were automatically coverted from the
linguistically encoded TEI.ana version of the machine translated ParlaMint-XX corpus.
Note that the vertical files do not contain all the information from the source TEI,
and that they used different element and attribute names from the TEI source.

Also included is the registry file, which is needed for noSketch Engine or
KonText (or, rather, their manatee back-end) to compile and mount the
corpus. The registry file has various values (such as paths to the data files)
that are specific to the noSketch Engine installation at CLARIN.SI, so they need
to be changed for any local installation of the corpus.
12 changes: 12 additions & 0 deletions Corpora/Docs/README.TEI.ana.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
Comparable parliamentary corpus
ParlaMint-XX.ana ZZ
TEI linguistically annotated version

Citation, documentation, download, and licence available from
YY

This directory contains the ParlaMint/TEI linguistically annotated version of
the ParlaMint-XX corpus. The root file ParlaMint-XX.ana.xml contains the
corpus teiHeader and XIncludes of the component files, which have the form
ParlaMint-XX_<suffix>.ana.xml. Also included are the XML schemas of the
corpus.
10 changes: 10 additions & 0 deletions Corpora/Docs/README.TEI.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
Comparable parliamentary corpus
ParlaMint-XX ZZ
TEI encoded version

Citation, documentation, download, and licence available from
YY

This directory contains the ParlaMint/TEI encoded version of the ParlaMint-XX
corpus. The root file ParlaMint-XX.xml contains the corpus teiHeader and
XIncludes of the component files, which have the form ParlaMint-XX_<suffix>.xml.
15 changes: 15 additions & 0 deletions Corpora/Docs/README.conll.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
Comparable parliamentary corpus
ParlaMint-XX.ana ZZ
Derived CoNLL-U encoded corpus with TSV metadata

Citation, documentation, download, and licence available from
YY

This directory contains the CoNLL-U encoded ParlaMint-XX.ana corpus, which
was automatically coverted from its linguistically encoded TEI.ana version. The
corpus also contains NER annotation in the IOB format. Additionally, each
CoNLL-U file has associated two TSV file giving the metadata of its speeches, one using
names in the corpus language and the other in English.

Note that the CoNLL-U files do not contain all the information from the source
corpus, in particular, the transcriber comments are not included.
30 changes: 30 additions & 0 deletions Corpora/Docs/README.md/README-AT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# ParlaMint directory for samples of country AT (Austria)

- Languages: de (German)

## Documentation

### Characteristics of the national parliament

The Austrian Parliament is bicameral and consists of the following two campers: the National Council (“Nationalrat”) and the Federal Council (“Bundesrat”). The political system is a multi-party system. The ParlaMint-AT corpus contains the shorthand records of the plenary sittings of the National Council from term 20 to term 27 (1996 - 2022).

### Data source and acquisition

The shorthand records are freely available on the website of the Austrian Parliament (https://www.parlament.gv.at/PAKT/STPROT/) as HTML or pdf documents since the 20th legislative period. For the earlier legislative periods only scanned originals in pdf are available. As data source for the ParlaMint-AT corpus the HTML version of the shorthand records were scraped from the Austrian parliamentary website. It has to be noted that the original HTML documents are encoded in Windows-1252 and not in UTF8.
Metadata about legislative periods, governments and persons was also retrieved in HTML format from https://www.parlament.gv.at/PAKT/STPROT/ and subsequently transformed into XML-TEI using dedicated perl scripts.

### Data encoding process

The original HTML data was first cleaned of obvious formatting errors by applying string substitutions in perl and then transformed to XHTML using tidy html 5. This data then is further transformed into TEI-XML using a series of scripts in perl and xslt which were created previously for the ParlAT corpus.

### Corpus-specific metadata

There is no metadata available going beyond what’s common for all corpora.

### Structure

There are no additional TEI elements beyond what’s described in the ParlaMint schema.

### Linguistic annotation

There is no specific linguistic annotation going beyond what’s common for all corpora.
41 changes: 41 additions & 0 deletions Corpora/Docs/README.md/README-BA.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# ParlaMint directory for samples of country BA (Bosnia and Herzegovina)

- Languages: bs (Bosnian)


## Documentation

### Characteristics of the national parliament

The Parliamentary Assembly of Bosnia and Herzegovina is the legislative body of Bosnia and Herzegovina. It consists of two chambers: The House of Representatives (42 members) and The House of Peoples (15 members). The parliament is elected every four years. The corpus contains unauthorized (but officially published) transcripts of parliamentary sessions from both houses. It covers the period of 1998-2022.

### Data source and acquisition

Transcripts of parliamentary debates were collected from the official website of the Parliamentary Assembly of Bosnia and Herzegovina and cover the period from 1998 to 2022. Records were originally stored as machine-readable PDF files with a loose structure and fluid form over different terms (https://www.parlament.ba/session/Read?ConvernerId=2; https://www.parlament.ba/session/Read?ConvernerId=1). Each document was parsed and text-mined using regular expressions (RegEx) in order to construct a proto-dataset with a simple structure having just two entries: a speaker (most often first and last name) and a speech (a string of text capturing transcribed spoken word in Bosnian-Croatian-Serbian). It was then further populated with meta-information assigned to its parent file – House of Parliament, date, and session number. Finally, the names of MPs were linked with their party affiliation and biographic information collected from the official website of the parliament (https://www.parlament.ba/delegate/list; https://www.parlament.ba/representative/list). Missing entries were filled manually based on an extensive online search. As raw text exported from PDF files does not contain any formatting tags, additional information on agenda points had to be extracted using regular expressions and checked manually. Agenda points were then used for identification of moderators. This was done for all terms with several rounds of cleaning and parsing. The speeches from 1998-2018 were collected as a part of an ERC-funded project ELWar (https://zenodo.org/record/6521063).

### Data encoding process

The data were initially structured in four different parts:

- a table with transcriptions and their utterance IDs,
- a table with metadata on specific utterance IDs, including the ID of the speaker, date, term, house, speaker role and party,
- a table linking speaker ID with their personal data (e.g., their date and place of birth, education, party)
- a table describing parties, their abbreviation, full names, chairs, their coalition composition in specific terms, coalition vs. opposition statuses, and more.

The first two resources were used during the construction of the component TEI documents, while the last two were encoded in the root TEI. The data were checked for inconsistencies and imputed as best as possible using government sources (e.g. parlament.ba) and independent projects (e.g. javnarasprava.ba).

The data were read and cleaned using the python pandas library, after which a component XML template had been prepared. Day-level grouped data were packaged into a TEI-compatible format using the xmltree Python library, and inserted into the template. The root TEI document was prepared in a similar way, with the goal of encoding members of the parliament and the parties present in the data.

Finally, a regex + xmltree pipeline was run over the data to detect transcriber comments in the transcripts and to encode them in the TEI format as different types of notes, interruptions, gaps, or applause.

### Corpus-specific metadata

There are no metadata available beyond what is common for all corpora.

### Structure

There are no additional TEI elements beyond what is described in the ParlaMint schema.

### Linguistic annotation

For annotating the Bosnian corpus, the standard language models for Croatian of the CLASSLA-Stanza pipeline (https://pypi.org/project/classla/) were used. On the level of morphosyntactic annotation for this corpus MULTEXT-East annotations (http://nl.ijs.si/ME/V6/msd/html/msd-hbs.html) are made available as well.
62 changes: 62 additions & 0 deletions Corpora/Docs/README.md/README-BE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# ParlaMint directory for samples of country BE (Belgium)

- Languages: fr (French), nl (Dutch)


## Documentation

### Characteristics of the national parliament

The Belgian Federal Parliament is the bicameral parliament of Belgium. It consists of the Chamber of Representatives (https://www.dekamer.be/) and the Senate (https://www.senate.be).

The current corpus consists of transcripts of the plenary sessions and the committee meetings of the Chamber of Representatives.

The plenary assembly is the assembly of 150 directly elected representatives of the people.

Its main tasks(https://www.dekamer.be/kvvcr/showpage.cfm?section=/pri/competence&language=nl&story=competence.xml) are to monitor government policy and public finance and to control legislation; together with the Senate, the Chamber is responsible for the Constitution and legislation concerning the organisation of the State. For all other legislation, the Chamber alone is competent.

The committees prepare the work of the plenary, which allows it to work more efficiently and quickly. Draft laws and proposals (bills, motions for resolutions, proposals to set up a committee of enquiry, proposals to revise the Constitution) are presented, discussed, possibly amended and voted on. The report of the discussion and the text adopted by the committee are then submitted to the plenary. Besides preparing the legislative work, the committees also exercise control over the government through interpellations and oral questions.

### Data source and acquisition

The source data were obtained by scraping from the parliamentary website (https://www.dekamer.be/). It consists of HTML apparently exported from Microsoft Word.

Further details can be found in the corpus headers and in the table below:

| Period | 2015-2020 |
| :---- |:---- |
| Size | 356 plenary sessions, 1335 committee meetings, 148425 speeches, 32563557 tokens |
| Language | Mainly mixed French and Dutch (55% French, 45% Dutch, measured in annotated tokens). Several hundreds of German utterances.
| Source format |HTML apparently exported from Microsoft word |
| Data harvesting | Scraping from the parliamentary website (https://www.dekamer.be/) |
| Availability | Public domain; Available from CLARIN website as part and INT Language resource repository.
Handles: http://hdl.handle.net/11356/1388 for the unannoted corpus, http://hdl.handle.net/11356/1405 for the linguistically annotated corpus. |


### Data encoding process

The conversion consists of several steps to transform and enrich the html source.

- The first step was to transform the html to xml, omitting irrelevant html tags and keeping the meaningful elements.
- The second step consists of a set of regex-based search and replace actions on the xml to prepare the transformation to TEI with two XSLT stylesheets.
- In the last step we added a language detection with a Python script, as we discovered that this module did a better job than the original MS Word language recognition in some cases.

The main challenges were related to the unstructured nature of the source data. We had to deal with many inconsistencies in the use of html elements, classes and styles. It was a challenging task to recognize the beginning and ending of the speeches and to separate them into monolingual segments.

### Structure

The dependency parser sometimes trips over long sentences (200 tokens or more, mostly enumerations). They are annotated as follows:
```XML
<gap reason="editorial">
<desc>Sentence could not be parsed: [sentence]</desc>
</gap>
```

### Linguistic annotation

The linguistic processing involves universal dependencies PoS and dependency relations, lemma, and four-class (PER, LOC, ORG, MISC) named entity recognition. The process for the BE corpus consists of:

- Language identification, consisting of a combination of the Microsoft Office language identification present in the source documents and the python language identification module langdetect (https://pypi.org/project/langdetect/).
- Tokenization (Dutch and French) and Tagging/Lemmatizing (Dutch only) by means of an INT in-house tagger based on Support Vector Machines, which supports TEI input and output.
- Dependency parsing and NER, using the trankit (https://github.com/nlp-uoregon/trankit) universal dependencies pipeline.
- Post-processing to conform to the strict Parlamint Schema, to generate the corpus header from the metadata database and the component files, and to remove incorrectly identified named entities in the first position of sentences for French.
Loading

0 comments on commit 1695edb

Please sign in to comment.