diff --git a/Corpora/Docs/Makefile b/Corpora/Docs/Makefile new file mode 100644 index 000000000..c781f10dd --- /dev/null +++ b/Corpora/Docs/Makefile @@ -0,0 +1,23 @@ +## Copy and rename README and registry files so they can be included in a distribution + +cp_all: cp_readmes cp_regis + +README_CORPORA = AT BA BE BG CZ DK EE ES ES-CT ES-GA ES-PV FI FR GB GR HR HU IS IT LV NL NO PL PT RS SE SI TR UA +REGIST_VERSION = 40 +REGIST_CORPORA = at ba be bg cz dk ee es es_ct es_ga es_pv fi fr gb gr hr hu is it lv nl no pl pt rs se si tr ua + +# Main readmes from Data/, store here and rename +cp_readmes: + rm -f README.md/*.md + for CORPUS in ${README_CORPORA}; do \ + cp ../../Samples/ParlaMint-$${CORPUS}/README.md \ + README.md/README-$${CORPUS}.md; \ + done; + +# Registry files +cp_regis: + rm -f registry/* + for CORPUS in ${REGIST_CORPORA}; do \ + cp /project/clarinsi-cqp/registry/parlamint${REGIST_VERSION}_$${CORPUS} \ + registry/parlamint${REGIST_VERSION}_$${CORPUS}; \ + done; diff --git a/Corpora/Docs/README-en.TEI.ana.txt b/Corpora/Docs/README-en.TEI.ana.txt new file mode 100644 index 000000000..05b438153 --- /dev/null +++ b/Corpora/Docs/README-en.TEI.ana.txt @@ -0,0 +1,12 @@ + Comparable parliamentary corpus + ParlaMint-XX.ana ZZ + TEI linguistically annotated version + + Citation, documentation, download, and licence available from + YY + +This directory contains the ParlaMint/TEI linguistically annotated version of +the machine translated ParlaMint-XX.ana corpus. The root file ParlaMint-XX.ana.xml +contains the corpus teiHeader and XIncludes of the component files, which have the form +ParlaMint-XX_.ana.xml. Also included are the XML schemas of the +corpus. diff --git a/Corpora/Docs/README-en.TEI.txt b/Corpora/Docs/README-en.TEI.txt new file mode 100644 index 000000000..ac0c9d502 --- /dev/null +++ b/Corpora/Docs/README-en.TEI.txt @@ -0,0 +1,11 @@ + Comparable parliamentary corpus + ParlaMint-XX ZZ + TEI encoded version + + Citation, documentation, download, and licence available from + YY + +This directory contains the ParlaMint/TEI encoded version of the machine +translated ParlaMint-XX corpus. The root file ParlaMint-XX.xml contains +the corpus teiHeader and XIncludes of the component files, which have the +form ParlaMint-XX_.xml. diff --git a/Corpora/Docs/README-en.conll.txt b/Corpora/Docs/README-en.conll.txt new file mode 100644 index 000000000..0006b365a --- /dev/null +++ b/Corpora/Docs/README-en.conll.txt @@ -0,0 +1,14 @@ + Comparable parliamentary corpus + ParlaMint-XX.ana ZZ + Derived CoNLL-U encoded corpus with TSV metadata + + Citation, documentation, download, and licence available from + YY + +This directory contains the CoNLL-U encoded machine translated ParlaMint-XX.ana corpus, +which was automatically coverted from its linguistically encoded TEI.ana version. The +corpus also contains NER annotation in the IOB format. Additionally, each +CoNLL-U file has an associated TSV file giving the metadata of its speeches. + +Note that the CoNLL-U files do not contain all the information from the source +corpus, in particular, the transcriber comments are not included. diff --git a/Corpora/Docs/README-en.schema.txt b/Corpora/Docs/README-en.schema.txt new file mode 100644 index 000000000..852ebeead --- /dev/null +++ b/Corpora/Docs/README-en.schema.txt @@ -0,0 +1,55 @@ + Comparable parliamentary corpora + ParlaMint ZZ + Parla-CLARIN and ParlaMint schemas with conversion scripts + + Citation, documentation, download, and licence available from + YY + +This directory contains: + +1. The Parla-CLARIN schema and documentation, archived from + https://github.com/clarin-eric/parla-clarin. This schema was used + as the overall frame in which the ParlaMint corpora were encoded, + and is, in this context, useful mostly for its documentation that + can be found in the Parla-CLARIN/docs directory. + +2. The ParlaMint schemas, which were made just for ParlaMint corpora + and attempt to maximally constrain the encoding to be exactly that + as used by the ParlaMint. More about these below. + +3. Some XSLT scripts in directory bin/ that can be used to convert + ParlaMint TEI encoded files into other formats, such as plain text, + CoNLL-U and vertical files. + +ParlaMint schemas + +The ParlaMint schemas are available in RelaxNG XML format (.rng), +RelaxNG compact format (.rnc) and in the W3C Schema language +(.xsd). As ParlaMint corpora are too large to be validated as one +document, and, furthermore, exist in two versions (the "plain text" +one, and the linguistically annotated one) there are four schemas for +validation: + +- ParlaMint-TEI, used to validate component (TEI rooted) files of a + ParlaMint corpus. This schema also contains most of the definitions + that are imported by the other schemas. + +- ParlaMint-teiCorpus, used to validate the top-level (teiCorpus + rooted) file of a ParlaMint corpus. The file should contain + XIncludes of the corpus components. + +- ParlaMint-TEI.ana, used to validate component (TEI rooted) files of + the linguistically annotated version of a ParlaMint corpus. + +- ParlaMint-teiCorpus.ana, used to validate the top-level (teiCorpus + rooted) file of the linguistically annotated version of a ParlaMint + corpus. + +For validating the ParlaMint corpus of country XX using standard +ParlaMint names for directories and files, a validation run under Unix +using jing installed at /usr/share/java/ would be: + +$ java -jar /usr/share/java/jing.jar ParlaMint-teiCorpus.rng ParlaMint-XX/ParlaMint-XX.xml +$ java -jar /usr/share/java/jing.jar ParlaMint-TEI.rng ParlaMint-XX/ParlaMint-XX_*.xml +$ java -jar /usr/share/java/jing.jar ParlaMint-teiCorpus.ana.rng ParlaMint-XX.ana/ParlaMint-XX.ana.xml +$ java -jar /usr/share/java/jing.jar ParlaMint-TEI.ana.rng ParlaMint-XX.ana/ParlaMint-XX_*ana.xml diff --git a/Corpora/Docs/README-en.text.txt b/Corpora/Docs/README-en.text.txt new file mode 100644 index 000000000..50f6d9546 --- /dev/null +++ b/Corpora/Docs/README-en.text.txt @@ -0,0 +1,13 @@ + Comparable parliamentary corpus + ParlaMint-XX ZZ + Derived plain-text corpus with TSV metadata + + Citation, documentation, download, and licence available from + YY + +This directory contains the machine translated ParlaMint-XX corpus as plain text, +which was automatically coverted from the TEI encoded version of the corpus. The text +files have two TAB-separated columns, the first one giving the ID of the speech, +and the second the plain text of the speech. Transcriber comments are given in +double square brackets. Additionally, each text file has an associated TSV file +giving the metadata for each speech. diff --git a/Corpora/Docs/README-en.vert.txt b/Corpora/Docs/README-en.vert.txt new file mode 100644 index 000000000..1a6c76440 --- /dev/null +++ b/Corpora/Docs/README-en.vert.txt @@ -0,0 +1,18 @@ + Comparable parliamentary corpus + ParlaMint-XX ZZ + Version with derived vertically encoded corpus + + Citation, documentation, download, and licence available from + YY + +This directory contains the so called vertical files (the format used by the CQP +and (no)Sketch Engine concordancers), which were automatically coverted from the +linguistically encoded TEI.ana version of the machine translated ParlaMint-XX corpus. +Note that the vertical files do not contain all the information from the source TEI, +and that they used different element and attribute names from the TEI source. + +Also included is the registry file, which is needed for noSketch Engine or +KonText (or, rather, their manatee back-end) to compile and mount the +corpus. The registry file has various values (such as paths to the data files) +that are specific to the noSketch Engine installation at CLARIN.SI, so they need +to be changed for any local installation of the corpus. diff --git a/Corpora/Docs/README.TEI.ana.txt b/Corpora/Docs/README.TEI.ana.txt new file mode 100644 index 000000000..28f46094e --- /dev/null +++ b/Corpora/Docs/README.TEI.ana.txt @@ -0,0 +1,12 @@ + Comparable parliamentary corpus + ParlaMint-XX.ana ZZ + TEI linguistically annotated version + + Citation, documentation, download, and licence available from + YY + +This directory contains the ParlaMint/TEI linguistically annotated version of +the ParlaMint-XX corpus. The root file ParlaMint-XX.ana.xml contains the +corpus teiHeader and XIncludes of the component files, which have the form +ParlaMint-XX_.ana.xml. Also included are the XML schemas of the +corpus. diff --git a/Corpora/Docs/README.TEI.txt b/Corpora/Docs/README.TEI.txt new file mode 100644 index 000000000..27afcee27 --- /dev/null +++ b/Corpora/Docs/README.TEI.txt @@ -0,0 +1,10 @@ + Comparable parliamentary corpus + ParlaMint-XX ZZ + TEI encoded version + + Citation, documentation, download, and licence available from + YY + +This directory contains the ParlaMint/TEI encoded version of the ParlaMint-XX +corpus. The root file ParlaMint-XX.xml contains the corpus teiHeader and +XIncludes of the component files, which have the form ParlaMint-XX_.xml. diff --git a/Corpora/Docs/README.conll.txt b/Corpora/Docs/README.conll.txt new file mode 100644 index 000000000..c9499d837 --- /dev/null +++ b/Corpora/Docs/README.conll.txt @@ -0,0 +1,15 @@ + Comparable parliamentary corpus + ParlaMint-XX.ana ZZ + Derived CoNLL-U encoded corpus with TSV metadata + + Citation, documentation, download, and licence available from + YY + +This directory contains the CoNLL-U encoded ParlaMint-XX.ana corpus, which +was automatically coverted from its linguistically encoded TEI.ana version. The +corpus also contains NER annotation in the IOB format. Additionally, each +CoNLL-U file has associated two TSV file giving the metadata of its speeches, one using +names in the corpus language and the other in English. + +Note that the CoNLL-U files do not contain all the information from the source +corpus, in particular, the transcriber comments are not included. diff --git a/Corpora/Docs/README.md/README-AT.md b/Corpora/Docs/README.md/README-AT.md new file mode 100644 index 000000000..d0bb83baf --- /dev/null +++ b/Corpora/Docs/README.md/README-AT.md @@ -0,0 +1,30 @@ +# ParlaMint directory for samples of country AT (Austria) + +- Languages: de (German) + +## Documentation + +### Characteristics of the national parliament + +The Austrian Parliament is bicameral and consists of the following two campers: the National Council (“Nationalrat”) and the Federal Council (“Bundesrat”). The political system is a multi-party system. The ParlaMint-AT corpus contains the shorthand records of the plenary sittings of the National Council from term 20 to term 27 (1996 - 2022). + +### Data source and acquisition + +The shorthand records are freely available on the website of the Austrian Parliament (https://www.parlament.gv.at/PAKT/STPROT/) as HTML or pdf documents since the 20th legislative period. For the earlier legislative periods only scanned originals in pdf are available. As data source for the ParlaMint-AT corpus the HTML version of the shorthand records were scraped from the Austrian parliamentary website. It has to be noted that the original HTML documents are encoded in Windows-1252 and not in UTF8. +Metadata about legislative periods, governments and persons was also retrieved in HTML format from https://www.parlament.gv.at/PAKT/STPROT/ and subsequently transformed into XML-TEI using dedicated perl scripts. + +### Data encoding process + +The original HTML data was first cleaned of obvious formatting errors by applying string substitutions in perl and then transformed to XHTML using tidy html 5. This data then is further transformed into TEI-XML using a series of scripts in perl and xslt which were created previously for the ParlAT corpus. + +### Corpus-specific metadata + +There is no metadata available going beyond what’s common for all corpora. + +### Structure + +There are no additional TEI elements beyond what’s described in the ParlaMint schema. + +### Linguistic annotation + +There is no specific linguistic annotation going beyond what’s common for all corpora. diff --git a/Corpora/Docs/README.md/README-BA.md b/Corpora/Docs/README.md/README-BA.md new file mode 100644 index 000000000..93783cb55 --- /dev/null +++ b/Corpora/Docs/README.md/README-BA.md @@ -0,0 +1,41 @@ +# ParlaMint directory for samples of country BA (Bosnia and Herzegovina) + +- Languages: bs (Bosnian) + + +## Documentation + +### Characteristics of the national parliament + +The Parliamentary Assembly of Bosnia and Herzegovina is the legislative body of Bosnia and Herzegovina. It consists of two chambers: The House of Representatives (42 members) and The House of Peoples (15 members). The parliament is elected every four years. The corpus contains unauthorized (but officially published) transcripts of parliamentary sessions from both houses. It covers the period of 1998-2022. + +### Data source and acquisition + +Transcripts of parliamentary debates were collected from the official website of the Parliamentary Assembly of Bosnia and Herzegovina and cover the period from 1998 to 2022. Records were originally stored as machine-readable PDF files with a loose structure and fluid form over different terms (https://www.parlament.ba/session/Read?ConvernerId=2; https://www.parlament.ba/session/Read?ConvernerId=1). Each document was parsed and text-mined using regular expressions (RegEx) in order to construct a proto-dataset with a simple structure having just two entries: a speaker (most often first and last name) and a speech (a string of text capturing transcribed spoken word in Bosnian-Croatian-Serbian). It was then further populated with meta-information assigned to its parent file – House of Parliament, date, and session number. Finally, the names of MPs were linked with their party affiliation and biographic information collected from the official website of the parliament (https://www.parlament.ba/delegate/list; https://www.parlament.ba/representative/list). Missing entries were filled manually based on an extensive online search. As raw text exported from PDF files does not contain any formatting tags, additional information on agenda points had to be extracted using regular expressions and checked manually. Agenda points were then used for identification of moderators. This was done for all terms with several rounds of cleaning and parsing. The speeches from 1998-2018 were collected as a part of an ERC-funded project ELWar (https://zenodo.org/record/6521063). + +### Data encoding process + +The data were initially structured in four different parts: + +- a table with transcriptions and their utterance IDs, +- a table with metadata on specific utterance IDs, including the ID of the speaker, date, term, house, speaker role and party, +- a table linking speaker ID with their personal data (e.g., their date and place of birth, education, party) +- a table describing parties, their abbreviation, full names, chairs, their coalition composition in specific terms, coalition vs. opposition statuses, and more. + +The first two resources were used during the construction of the component TEI documents, while the last two were encoded in the root TEI. The data were checked for inconsistencies and imputed as best as possible using government sources (e.g. parlament.ba) and independent projects (e.g. javnarasprava.ba). + +The data were read and cleaned using the python pandas library, after which a component XML template had been prepared. Day-level grouped data were packaged into a TEI-compatible format using the xmltree Python library, and inserted into the template. The root TEI document was prepared in a similar way, with the goal of encoding members of the parliament and the parties present in the data. + +Finally, a regex + xmltree pipeline was run over the data to detect transcriber comments in the transcripts and to encode them in the TEI format as different types of notes, interruptions, gaps, or applause. + +### Corpus-specific metadata + +There are no metadata available beyond what is common for all corpora. + +### Structure + +There are no additional TEI elements beyond what is described in the ParlaMint schema. + +### Linguistic annotation + +For annotating the Bosnian corpus, the standard language models for Croatian of the CLASSLA-Stanza pipeline (https://pypi.org/project/classla/) were used. On the level of morphosyntactic annotation for this corpus MULTEXT-East annotations (http://nl.ijs.si/ME/V6/msd/html/msd-hbs.html) are made available as well. \ No newline at end of file diff --git a/Corpora/Docs/README.md/README-BE.md b/Corpora/Docs/README.md/README-BE.md new file mode 100644 index 000000000..e70ccc83a --- /dev/null +++ b/Corpora/Docs/README.md/README-BE.md @@ -0,0 +1,62 @@ +# ParlaMint directory for samples of country BE (Belgium) + +- Languages: fr (French), nl (Dutch) + + +## Documentation + +### Characteristics of the national parliament + +The Belgian Federal Parliament is the bicameral parliament of Belgium. It consists of the Chamber of Representatives (https://www.dekamer.be/) and the Senate (https://www.senate.be). + +The current corpus consists of transcripts of the plenary sessions and the committee meetings of the Chamber of Representatives. + +The plenary assembly is the assembly of 150 directly elected representatives of the people. + +Its main tasks(https://www.dekamer.be/kvvcr/showpage.cfm?section=/pri/competence&language=nl&story=competence.xml) are to monitor government policy and public finance and to control legislation; together with the Senate, the Chamber is responsible for the Constitution and legislation concerning the organisation of the State. For all other legislation, the Chamber alone is competent. + +The committees prepare the work of the plenary, which allows it to work more efficiently and quickly. Draft laws and proposals (bills, motions for resolutions, proposals to set up a committee of enquiry, proposals to revise the Constitution) are presented, discussed, possibly amended and voted on. The report of the discussion and the text adopted by the committee are then submitted to the plenary. Besides preparing the legislative work, the committees also exercise control over the government through interpellations and oral questions. + +### Data source and acquisition + +The source data were obtained by scraping from the parliamentary website (https://www.dekamer.be/). It consists of HTML apparently exported from Microsoft Word. + +Further details can be found in the corpus headers and in the table below: + +| Period | 2015-2020 | +| :---- |:---- | +| Size | 356 plenary sessions, 1335 committee meetings, 148425 speeches, 32563557 tokens | +| Language | Mainly mixed French and Dutch (55% French, 45% Dutch, measured in annotated tokens). Several hundreds of German utterances. +| Source format |HTML apparently exported from Microsoft word | +| Data harvesting | Scraping from the parliamentary website (https://www.dekamer.be/) | +| Availability | Public domain; Available from CLARIN website as part and INT Language resource repository. +Handles: http://hdl.handle.net/11356/1388 for the unannoted corpus, http://hdl.handle.net/11356/1405 for the linguistically annotated corpus. | + + +### Data encoding process + +The conversion consists of several steps to transform and enrich the html source. + +- The first step was to transform the html to xml, omitting irrelevant html tags and keeping the meaningful elements. +- The second step consists of a set of regex-based search and replace actions on the xml to prepare the transformation to TEI with two XSLT stylesheets. +- In the last step we added a language detection with a Python script, as we discovered that this module did a better job than the original MS Word language recognition in some cases. + +The main challenges were related to the unstructured nature of the source data. We had to deal with many inconsistencies in the use of html elements, classes and styles. It was a challenging task to recognize the beginning and ending of the speeches and to separate them into monolingual segments. + +### Structure + +The dependency parser sometimes trips over long sentences (200 tokens or more, mostly enumerations). They are annotated as follows: +```XML + + Sentence could not be parsed: [sentence] + +``` + +### Linguistic annotation + +The linguistic processing involves universal dependencies PoS and dependency relations, lemma, and four-class (PER, LOC, ORG, MISC) named entity recognition. The process for the BE corpus consists of: + +- Language identification, consisting of a combination of the Microsoft Office language identification present in the source documents and the python language identification module langdetect (https://pypi.org/project/langdetect/). +- Tokenization (Dutch and French) and Tagging/Lemmatizing (Dutch only) by means of an INT in-house tagger based on Support Vector Machines, which supports TEI input and output. +- Dependency parsing and NER, using the trankit (https://github.com/nlp-uoregon/trankit) universal dependencies pipeline. +- Post-processing to conform to the strict Parlamint Schema, to generate the corpus header from the metadata database and the component files, and to remove incorrectly identified named entities in the first position of sentences for French. diff --git a/Corpora/Docs/README.md/README-BG.md b/Corpora/Docs/README.md/README-BG.md new file mode 100644 index 000000000..ef54e2ccf --- /dev/null +++ b/Corpora/Docs/README.md/README-BG.md @@ -0,0 +1,43 @@ +# ParlaMint directory for samples of country BG (Bulgaria) + +- Language: bg (Bulgarian) + +## Documentation + +### Characteristics of the national parliament + +The Bulgarian Parliament is unicameral. The political system is a multi-party system. + +The corpus in its first phase contains plenary meetings from 2014-10-27 to 2020-07-31 and includes 717 documents or 19,096,761 words. The new data includes 2020-09-02 till 2022-07-29. These data have 204 documents. + +The challenge with the new data was the fact that in 2021 there were three elections for Parliament - thus many parliaments with short lives. + +### Data source and acquisition + +The data was downloaded from the official page of Bulgarian National Assembly manually since the site does not allow the whole data to be automatically downloaded. Thus, it took about 2 months to get the data for 5 years (2015-2020) and 1 month to get the data for the last two years (2020-2022). The minutes for each day are represented in a single html file which was easy to convert to XML. + +### Data encoding process + +The conversion was performed in an incremental way. Initially, the data was converted into basic TEI XML and uploaded into the CLaRK system. Then, the Parla-CLARIN DTD was used for validation. However, this turned out to be too permissive, so additional constraint schemata were applied. Within CLaRK the conversion was done with the help of constraints (as implemented rules) and regular grammars for joining some elements. The speaker and incident data was extracted, classified and returned back into the texts with the appropriate features added. + +For the speakers (mainly MPs) we collected information from the website of the parliament. Then the data was converted into XML person format defined by the Parla-CLARIN guide. For the speakers that were not part of the parliament at the time of corpus creation, we collected the data over the web (mainly from Wikipedia, websites of ministries, agencies and other institutions). For some of the guest speakers we ended up with very limited information. + +One problem that required some manual work was the connection between the record of the speaker names and the speaker records. The main problems were misspellings of their names and ambiguities between the names. + +For the TEI header component we prepared a parameterized version which was inserted into each meeting report document. Then the parameters were replaced with the actual data from the original XML document. + +After the validation of each document within the CLaRK system with respect to the Parla-CLARIN dtd, the documents were exported and validated with respect to the Parla-CLARIN Relax NG Schema. The validation was performed with the help of Oxygen XML Editor. Some errors were found during this validation. + +In Phase 2 of the project the previous part was improved with respect to the TEI format and metadata, errors were corrected, while also compiling the new data. This time github was extensively used for validation. + +### Corpus-specific metadata + +In addition to the actual debates there are Excel tables representing the voting results during the day. The voting results are not represented in the current version of the corpus, but they are downloaded and incorporated into the initial XML document for further processing and incorporation within the corpus. + +### Structure + +The corpus followed strictly the TEI elements/attributes that were needed at this stage. + +### Linguistic annotation + +For annotating the Bulgarian corpus the CLASSLA-Stanza pipeline (https://pypi.org/project/classla/) was used. Thus, it follows the UD morpho-syntactic schema. The NER module includes the traditional NEs: Person, Location, Organization and Misc. We would like to thank Nikola Ljubešić for training and running the tools. diff --git a/Corpora/Docs/README.md/README-CZ.md b/Corpora/Docs/README.md/README-CZ.md new file mode 100644 index 000000000..0b587d750 --- /dev/null +++ b/Corpora/Docs/README.md/README-CZ.md @@ -0,0 +1,46 @@ +# ParlaMint directory for samples of country CZ (Czech Republic) + +- Language: cs (Czech) + +## Documentation + +### Characteristics of the national parliament + +The Parliament of the Czech Republic (PCR) consists of two chambers: the Lower House (Chamber of Deputies) and the Upper House (Senate). Joint Czech and Slovak Digital Parliamentary Library contains recordings of the Assemblies from the earliest time of their existence (since the 10th century) until the very last sitting of PCR. Since the establishment of the first parliament of the new Czechoslovak Republic in 1918 the available documents are much more extensive. + +The Parliament works in the periods (terms) between one general election and the next. Regular meetings are organized and they typically take place more than one day. Each meeting has its own agenda and an agenda item is discussed in speeches that can be made at more than one sitting. For every term, there is a “nest”-style site to publish voting records, stenographic protocols, audio files, parliamentary prints, parliamentary documents, resolutions, decisions, interpellations, and biographical data about the members of PCR, boards, committees, delegations (e.g., see the site for the 9th term of the current Chamber of Deputies). + +The ParlaMint-CZ corpus contains the stenographic protocols of the Chamber of Deputies from the period 25th Nov 2013 - 18th Oct 2022. + +### Data source and acquisition + +We scraped the protocols from Joint Czech and Slovak Digital Parliamentary Library where the protocols are available for each meeting. Metadata about persons and organizations was extracted as a database dump (https://psp.cz/sqw/hp.sqw?k=1300). Metadata about members of the government was scraped from the website of the Czech Republic government. + +### Data encoding process + +1. We scrap the data using a Perl script directly into the TEI format and we + - split the texts into agenda items discussed in one sitting + - keep the original page-breaks () and the url links in the source data + - decode dates and times listed in the comments embedded in the texts + - keep the links to the audio files and detect missing links to the audio files + - detect the transcription notes given in brackets +2. We download the bibliographic metadata from the website of the Government. It can happen that a person has multiple ids in the original data sources. Therefore we fix it to have only one unique id for each person. The Government website lists dates of birth in persons’ CVs. That helps us to identify persons' records in the parliament database dump and thus the person ids are presented in the format ForenameSurname.birthyear. In addition we interlink the persons to various organizations (e.g., boards, committees, delegations). Compared to the previous version of the corpus, we have merged many organizations and converted the original organizations into events in the merged organizations. +3. We automatically categorize reporters’ notes using keywords and regexp search. +4. We linguistically annotate the texts and compute descriptive statistics from them (e.g., the number of words). We use the format of ParCzech project that is slightly richer than ParlaMint format. We run an XSLT transformation to convert the data into ParlaMint format. + +### Corpus-specific metadata + +- links to the source data (utterances, pages) +- links to the audio files that correspond to single source pages +- data on not only political parties, but other organizations as well +- records about members of parliament contain links to their personal web sites, facebook and official parliament photo + +### Structure + +We did not use any TEI structural elements/attributes going beyond what’s described in the ParlaMint schema. + +### Linguistic annotation + +- For the UD annotation we used UDPipe 2 with no specifics. +- For the NER annotation we used NameTag 2 (model czech-cnec2.0-200831) that classifies named-entities according to a two-level hierarchy of nested 46 named entities types and fourth complex container types (address, person name, bibliography citation, temporal expression). This rich taxonomy contains not only proper names but other entity types as well: and