Merge pull request #810 from clarin-eric/main

data-main
clarin-eric · Oct 11, 2023 · 1695edb · 1695edb
2 parents d38275c + ff8e532
commit 1695edb
Show file tree

Hide file tree

Showing 82 changed files with 7,749 additions and 628 deletions.
diff --git a/Corpora/Docs/Makefile b/Corpora/Docs/Makefile
@@ -0,0 +1,23 @@
+## Copy and rename README and registry files so they can be included in a distribution
+
+cp_all:	cp_readmes cp_regis
+
+README_CORPORA = AT BA BE BG CZ DK EE ES ES-CT ES-GA ES-PV FI FR GB GR HR HU IS IT LV NL NO PL PT RS SE SI TR UA
+REGIST_VERSION = 40
+REGIST_CORPORA = at ba be bg cz dk ee es es_ct es_ga es_pv fi fr gb gr hr hu is it lv nl no pl pt rs se si tr ua
+
+# Main readmes from Data/, store here and rename
+cp_readmes:
+	rm -f README.md/*.md
+	for CORPUS in ${README_CORPORA}; do \
+	cp ../../Samples/ParlaMint-$${CORPUS}/README.md \
+	README.md/README-$${CORPUS}.md; \
+	done;
+
+# Registry files 
+cp_regis:
+	rm -f registry/*
+	for CORPUS in ${REGIST_CORPORA}; do \
+	cp /project/clarinsi-cqp/registry/parlamint${REGIST_VERSION}_$${CORPUS} \
+	registry/parlamint${REGIST_VERSION}_$${CORPUS}; \
+	done;
diff --git a/Corpora/Docs/README-en.TEI.ana.txt b/Corpora/Docs/README-en.TEI.ana.txt
@@ -0,0 +1,12 @@
+                        Comparable parliamentary corpus
+                              ParlaMint-XX.ana ZZ
+                      TEI linguistically annotated version
+
+         Citation, documentation, download, and licence available from
+                        YY
+
+This directory contains the ParlaMint/TEI linguistically annotated version of
+the machine translated ParlaMint-XX.ana corpus. The root file ParlaMint-XX.ana.xml
+contains the corpus teiHeader and XIncludes of the component files, which have the form
+ParlaMint-XX_<suffix>.ana.xml. Also included are the XML schemas of the
+corpus.
diff --git a/Corpora/Docs/README-en.TEI.txt b/Corpora/Docs/README-en.TEI.txt
@@ -0,0 +1,11 @@
+                        Comparable parliamentary corpus
+                                ParlaMint-XX ZZ
+                              TEI encoded version
+
+         Citation, documentation, download, and licence available from
+                        YY
+
+This directory contains the ParlaMint/TEI encoded version of the machine
+translated ParlaMint-XX corpus. The root file ParlaMint-XX.xml contains
+the corpus teiHeader and XIncludes of the component files, which have the
+form ParlaMint-XX_<suffix>.xml.
diff --git a/Corpora/Docs/README-en.conll.txt b/Corpora/Docs/README-en.conll.txt
@@ -0,0 +1,14 @@
+                        Comparable parliamentary corpus
+                              ParlaMint-XX.ana ZZ
+                Derived CoNLL-U encoded corpus with TSV metadata
+
+         Citation, documentation, download, and licence available from
+                        YY
+
+This directory contains the CoNLL-U encoded machine translated ParlaMint-XX.ana corpus,
+which was automatically coverted from its linguistically encoded TEI.ana version. The
+corpus also contains NER annotation in the IOB format. Additionally, each
+CoNLL-U file has an associated TSV file giving the metadata of its speeches.
+
+Note that the CoNLL-U files do not contain all the information from the source
+corpus, in particular, the transcriber comments are not included.
diff --git a/Corpora/Docs/README-en.schema.txt b/Corpora/Docs/README-en.schema.txt
@@ -0,0 +1,55 @@
+                        Comparable parliamentary corpora
+                               ParlaMint ZZ
+      Parla-CLARIN and ParlaMint schemas with conversion scripts
+
+         Citation, documentation, download, and licence available from
+                       YY
+
+This directory contains:
+
+1. The Parla-CLARIN schema and documentation, archived from
+   https://github.com/clarin-eric/parla-clarin. This schema was used
+   as the overall frame in which the ParlaMint corpora were encoded,
+   and is, in this context, useful mostly for its documentation that
+   can be found in the Parla-CLARIN/docs directory.
+
+2. The ParlaMint schemas, which were made just for ParlaMint corpora
+   and attempt to maximally constrain the encoding to be exactly that
+   as used by the ParlaMint. More about these below.
+
+3. Some XSLT scripts in directory bin/ that can be used to convert
+   ParlaMint TEI encoded files into other formats, such as plain text,
+   CoNLL-U and vertical files.
+
+ParlaMint schemas
+
+The ParlaMint schemas are available in RelaxNG XML format (.rng),
+RelaxNG compact format (.rnc) and in the W3C Schema language
+(.xsd). As ParlaMint corpora are too large to be validated as one
+document, and, furthermore, exist in two versions (the "plain text"
+one, and the linguistically annotated one) there are four schemas for
+validation:
+
+- ParlaMint-TEI, used to validate component (TEI rooted) files of a
+  ParlaMint corpus. This schema also contains most of the definitions
+  that are imported by the other schemas.
+
+- ParlaMint-teiCorpus, used to validate the top-level (teiCorpus
+  rooted) file of a ParlaMint corpus. The file should contain
+  XIncludes of the corpus components.
+
+- ParlaMint-TEI.ana, used to validate component (TEI rooted) files of
+  the linguistically annotated version of a ParlaMint corpus.
+
+- ParlaMint-teiCorpus.ana, used to validate the top-level (teiCorpus
+  rooted) file of the linguistically annotated version of a ParlaMint
+  corpus.
+
+For validating the ParlaMint corpus of country XX using standard
+ParlaMint names for directories and files, a validation run under Unix
+using jing installed at /usr/share/java/ would be:
+
+$ java -jar /usr/share/java/jing.jar ParlaMint-teiCorpus.rng     ParlaMint-XX/ParlaMint-XX.xml
+$ java -jar /usr/share/java/jing.jar ParlaMint-TEI.rng           ParlaMint-XX/ParlaMint-XX_*.xml
+$ java -jar /usr/share/java/jing.jar ParlaMint-teiCorpus.ana.rng ParlaMint-XX.ana/ParlaMint-XX.ana.xml
+$ java -jar /usr/share/java/jing.jar ParlaMint-TEI.ana.rng       ParlaMint-XX.ana/ParlaMint-XX_*ana.xml
diff --git a/Corpora/Docs/README-en.text.txt b/Corpora/Docs/README-en.text.txt
@@ -0,0 +1,13 @@
+                        Comparable parliamentary corpus
+                                ParlaMint-XX ZZ
+                  Derived plain-text corpus with TSV metadata
+
+         Citation, documentation, download, and licence available from
+                        YY
+
+This directory contains the machine translated ParlaMint-XX corpus as plain text,
+which was automatically coverted from the TEI encoded version of the corpus. The text
+files have two TAB-separated columns, the first one giving the ID of the speech,
+and the second the plain text of the speech. Transcriber comments are given in
+double square brackets.  Additionally, each text file has an associated TSV file
+giving the metadata for each speech.
diff --git a/Corpora/Docs/README-en.vert.txt b/Corpora/Docs/README-en.vert.txt
@@ -0,0 +1,18 @@
+                        Comparable parliamentary corpus
+                               ParlaMint-XX ZZ
+                Version with derived vertically encoded corpus
+
+         Citation, documentation, download, and licence available from
+                       YY
+
+This directory contains the so called vertical files (the format used by the CQP
+and (no)Sketch Engine concordancers), which were automatically coverted from the
+linguistically encoded TEI.ana version of the machine translated ParlaMint-XX corpus.
+Note that the vertical files do not contain all the information from the source TEI,
+and that they used different element and attribute names from the TEI source.
+
+Also included is the registry file, which is needed for noSketch Engine or
+KonText (or, rather, their manatee back-end) to compile and mount the
+corpus. The registry file has various values (such as paths to the data files)
+that are specific to the noSketch Engine installation at CLARIN.SI, so they need
+to be changed for any local installation of the corpus.
diff --git a/Corpora/Docs/README.TEI.ana.txt b/Corpora/Docs/README.TEI.ana.txt
@@ -0,0 +1,12 @@
+                        Comparable parliamentary corpus
+                              ParlaMint-XX.ana ZZ
+                      TEI linguistically annotated version
+
+         Citation, documentation, download, and licence available from
+                        YY
+
+This directory contains the ParlaMint/TEI linguistically annotated version of
+the ParlaMint-XX corpus. The root file ParlaMint-XX.ana.xml contains the
+corpus teiHeader and XIncludes of the component files, which have the form
+ParlaMint-XX_<suffix>.ana.xml. Also included are the XML schemas of the
+corpus.
diff --git a/Corpora/Docs/README.TEI.txt b/Corpora/Docs/README.TEI.txt
@@ -0,0 +1,10 @@
+                        Comparable parliamentary corpus
+                                ParlaMint-XX ZZ
+                              TEI encoded version
+
+         Citation, documentation, download, and licence available from
+                        YY
+
+This directory contains the ParlaMint/TEI encoded version of the ParlaMint-XX
+corpus. The root file ParlaMint-XX.xml contains the corpus teiHeader and
+XIncludes of the component files, which have the form ParlaMint-XX_<suffix>.xml.
diff --git a/Corpora/Docs/README.conll.txt b/Corpora/Docs/README.conll.txt
@@ -0,0 +1,15 @@
+                        Comparable parliamentary corpus
+                              ParlaMint-XX.ana ZZ
+                Derived CoNLL-U encoded corpus with TSV metadata
+
+         Citation, documentation, download, and licence available from
+                        YY
+
+This directory contains the CoNLL-U encoded ParlaMint-XX.ana corpus, which
+was automatically coverted from its linguistically encoded TEI.ana version. The
+corpus also contains NER annotation in the IOB format. Additionally, each
+CoNLL-U file has associated two TSV file giving the metadata of its speeches, one using
+names in the corpus language and the other in English.
+
+Note that the CoNLL-U files do not contain all the information from the source
+corpus, in particular, the transcriber comments are not included.
diff --git a/Corpora/Docs/README.md/README-AT.md b/Corpora/Docs/README.md/README-AT.md
@@ -0,0 +1,30 @@
+# ParlaMint directory for samples of country AT (Austria)
+
+- Languages: de (German)
+
+## Documentation
+
+### Characteristics of the national parliament
+
+The Austrian Parliament is bicameral and consists of the following two campers: the National Council (“Nationalrat”)  and the Federal Council (“Bundesrat”). The political system is a multi-party system. The ParlaMint-AT corpus contains the shorthand records of the plenary sittings of the National Council from term 20 to term 27 (1996 - 2022).
+
+### Data source and acquisition
+
+The shorthand records are freely available on the website of the Austrian Parliament (https://www.parlament.gv.at/PAKT/STPROT/) as HTML or pdf documents since the 20th legislative period. For the earlier legislative periods only scanned originals in pdf are available. As data source for the ParlaMint-AT corpus  the HTML version of the shorthand records were scraped from the Austrian parliamentary website. It has to be noted that the original HTML documents are encoded in Windows-1252 and not in UTF8.
+Metadata about legislative periods, governments and persons was also retrieved in HTML format from  https://www.parlament.gv.at/PAKT/STPROT/ and subsequently transformed into XML-TEI using dedicated perl scripts.
+
+### Data encoding process
+
+The original HTML data was first cleaned of obvious formatting errors by applying string substitutions in perl and then transformed to XHTML using tidy html 5. This data then is further transformed into TEI-XML using a series of scripts in perl and xslt which were created previously for the ParlAT corpus.
+
+### Corpus-specific metadata
+
+There is no metadata available going beyond what’s common for all corpora.
+
+### Structure
+
+There are no additional TEI elements beyond what’s described in the ParlaMint schema.
+
+### Linguistic annotation
+
+There is no specific linguistic annotation going beyond what’s common for all corpora.
diff --git a/Corpora/Docs/README.md/README-BA.md b/Corpora/Docs/README.md/README-BA.md
@@ -0,0 +1,41 @@
+# ParlaMint directory for samples of country BA (Bosnia and Herzegovina)
+
+- Languages: bs (Bosnian)
+
+
+## Documentation
+
+### Characteristics of the national parliament
+
+The Parliamentary Assembly of Bosnia and Herzegovina is the legislative body of Bosnia and Herzegovina. It consists of two chambers: The House of Representatives (42 members) and The House of Peoples (15 members). The parliament is elected every four years. The corpus contains unauthorized (but officially published) transcripts of parliamentary sessions from both houses. It covers the period of 1998-2022.
+
+### Data source and acquisition
+
+Transcripts of parliamentary debates were collected from the official website of the Parliamentary Assembly of Bosnia and Herzegovina and cover the period from 1998 to 2022. Records were originally stored as machine-readable PDF files with a loose structure and fluid form over different terms (https://www.parlament.ba/session/Read?ConvernerId=2; https://www.parlament.ba/session/Read?ConvernerId=1). Each document was parsed and text-mined using regular expressions (RegEx) in order to construct a proto-dataset with a simple structure having just two entries: a speaker (most often first and last name) and a speech (a string of text capturing transcribed spoken word in Bosnian-Croatian-Serbian). It was then further populated with meta-information assigned to its parent file – House of Parliament, date, and session number. Finally, the names of MPs were linked with their party affiliation and biographic information collected from the official website of the parliament (https://www.parlament.ba/delegate/list; https://www.parlament.ba/representative/list). Missing entries were filled manually based on an extensive online search. As raw text exported from PDF files does not contain any formatting tags, additional information on agenda points had to be extracted using regular expressions and checked manually. Agenda points were then used for identification of moderators. This was done for all terms with several rounds of cleaning and parsing. The speeches from 1998-2018 were collected as a part of an ERC-funded project ELWar (https://zenodo.org/record/6521063).
+
+### Data encoding process
+
+The data were initially structured in four different parts:
+
+- a table with transcriptions and their utterance IDs,
+- a table with metadata on specific utterance IDs, including the ID of the speaker, date, term, house, speaker role and party,
+- a table linking speaker ID with their personal data (e.g., their date and place of birth, education, party)
+- a table describing parties, their abbreviation, full names, chairs, their coalition composition in specific terms, coalition vs. opposition statuses, and more.
+
+The first two resources were used during the construction of the component TEI documents, while the last two were encoded in the root TEI. The data were checked for inconsistencies and imputed as best as possible using government sources (e.g. parlament.ba) and independent projects (e.g. javnarasprava.ba).
+
+The data were read and cleaned using the python pandas library, after which a component XML template had been prepared. Day-level grouped data were packaged into a TEI-compatible format using the xmltree Python library, and inserted into the template. The root TEI document was prepared in a similar way, with the goal of encoding members of the parliament and the parties present in the data.
+
+Finally, a regex + xmltree pipeline was run over the data to detect transcriber comments in the transcripts and to encode them in the TEI format as different types of notes, interruptions, gaps, or applause.
+
+### Corpus-specific metadata
+
+There are no metadata available beyond what is common for all corpora.
+
+### Structure
+
+There are no additional TEI elements beyond what is described in the ParlaMint schema.
+
+### Linguistic annotation
+
+For annotating the Bosnian corpus, the standard language models for Croatian of the CLASSLA-Stanza pipeline (https://pypi.org/project/classla/) were used. On the level of morphosyntactic annotation for this corpus MULTEXT-East annotations (http://nl.ijs.si/ME/V6/msd/html/msd-hbs.html) are made available as well.
diff --git a/Corpora/Docs/README.md/README-BE.md b/Corpora/Docs/README.md/README-BE.md
@@ -0,0 +1,62 @@
+# ParlaMint directory for samples of country BE (Belgium)
+
+- Languages: fr (French), nl (Dutch)
+
+
+## Documentation
+
+### Characteristics of the national parliament
+
+The Belgian Federal Parliament is the bicameral parliament of Belgium. It consists of the Chamber of Representatives (https://www.dekamer.be/) and the Senate (https://www.senate.be).
+
+The current corpus consists of transcripts of the plenary sessions and the committee meetings of the Chamber of Representatives.
+
+The plenary assembly is the assembly of 150 directly elected representatives of the people.
+
+Its main tasks(https://www.dekamer.be/kvvcr/showpage.cfm?section=/pri/competence&language=nl&story=competence.xml) are to monitor government policy and public finance and to control legislation; together with the Senate, the Chamber is responsible for the Constitution and legislation concerning the organisation of the State. For all other legislation, the Chamber alone is competent.
+
+The committees prepare the work of the plenary, which allows it to work more efficiently and quickly. Draft laws and proposals (bills, motions for resolutions, proposals to set up a committee of enquiry, proposals to revise the Constitution) are presented, discussed, possibly amended and voted on. The report of the discussion and the text adopted by the committee are then submitted to the plenary. Besides preparing the legislative work, the committees also exercise control over the government through interpellations and oral questions.
+
+### Data source and acquisition
+
+The source data were obtained by scraping from the parliamentary website (https://www.dekamer.be/). It consists of HTML apparently exported from Microsoft Word.
+
+Further details can be found in the corpus headers and in the table below:
+
+| Period | 2015-2020 |
+| :----  |:---- |
+| Size | 356 plenary sessions, 1335 committee meetings, 148425 speeches, 32563557 tokens  |
+| Language | Mainly mixed French and Dutch (55% French, 45% Dutch, measured in annotated tokens). Several hundreds of German utterances.
+| Source format |HTML apparently exported from Microsoft word |
+| Data harvesting | Scraping from the parliamentary website (https://www.dekamer.be/) |
+| Availability | Public domain; Available from CLARIN website as part and INT Language resource repository.
+Handles: http://hdl.handle.net/11356/1388 for the unannoted corpus, http://hdl.handle.net/11356/1405 for the linguistically annotated corpus. |
+
+
+### Data encoding process
+
+The conversion consists of several steps to transform and enrich the html source.
+
+- The first step was to transform the html to xml, omitting irrelevant html tags and keeping the meaningful elements.
+- The second step consists of a set of regex-based search and replace actions on the xml to prepare the transformation to TEI with two XSLT stylesheets.
+- In the last step we added a language detection with a Python script, as we discovered that this module did a better job than the original MS Word language recognition in some cases.
+
+The main challenges were related to the unstructured nature of the source data. We had to deal with many inconsistencies in the use of html elements, classes and styles. It was a challenging task to recognize the beginning and ending of the speeches and to separate them into monolingual segments.
+
+### Structure
+
+The dependency parser sometimes trips over long sentences (200 tokens or more, mostly enumerations). They are annotated as follows:
+```XML
+<gap reason="editorial">
+  <desc>Sentence could not be parsed: [sentence]</desc>
+</gap>
+```
+
+### Linguistic annotation
+
+The linguistic processing involves universal dependencies PoS and dependency relations, lemma, and four-class (PER, LOC, ORG, MISC) named entity recognition. The process for the BE corpus consists of:
+
+- Language identification, consisting of a combination of the Microsoft Office language identification present in the source documents and the python language identification module langdetect (https://pypi.org/project/langdetect/).
+- Tokenization (Dutch and French) and Tagging/Lemmatizing (Dutch only) by means of an INT in-house tagger based on Support Vector Machines, which supports TEI input and output.
+- Dependency parsing and NER, using the trankit (https://github.com/nlp-uoregon/trankit) universal dependencies pipeline.
+- Post-processing to conform to the strict Parlamint Schema, to generate the corpus header from the metadata database and the component files, and to remove incorrectly identified named entities in the first position of sentences for French.