Skip to content

Commit

Permalink
Merge pull request #875 from clarin-eric/devel
Browse files Browse the repository at this point in the history
Devel
  • Loading branch information
matyaskopp authored Jun 3, 2024
2 parents c638251 + 2ba11b4 commit 9df7751
Show file tree
Hide file tree
Showing 39 changed files with 669 additions and 25,385 deletions.
12 changes: 6 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,19 +9,19 @@ which are a customisation of the [TEI Guidelines](https://tei-c.org/guidelines/p
should process the common data in any ParlaMint corpus, despite the differing parliamentary
systems of the countries, the kind of information included in the corpora, and, of course, language.

The latest version of ParlaMint is [4.0-en](https://github.com/clarin-eric/ParlaMint/releases/tag/v4.0-en)
The latest version of ParlaMint is [4.1](https://github.com/clarin-eric/ParlaMint/releases/tag/v4.1)
which contains corpora for 29 countries and autonomous regions in original languages as well as machine
translated to English, and is available from the CLARIN.SI repository:

- [ParlaMint-en.ana v4.0](http://hdl.handle.net/11356/1864): linguistically annotated machine translated ParlaMint corpora
- [ParlaMint.ana v4.0](http://hdl.handle.net/11356/1860): linguistically annotated variant of the ParlaMint corpora in original languages
- [ParlaMint v4.0](http://hdl.handle.net/11356/1859): "plain text", i.e. linguistically unannotated variant of the ParlaMint corpora in original languages
- [ParlaMint v4.1](http://hdl.handle.net/11356/1912): "plain text", i.e. linguistically unannotated variant of the ParlaMint corpora
- [ParlaMint.ana v4.1](http://hdl.handle.net/11356/1911): linguistically annotated variant of the ParlaMint corpora
- [ParlaMint-en.ana v4.1](http://hdl.handle.net/11356/1910): machine translated and linguistically annotated ParlaMint corpora

Publications connected to ParlaMint are available at the
[ParlaMint project page](https://www.clarin.eu/parlamint#publications-and%C2%A0presentations).

The two most comprehensive publication on ParlaMint corpora a preprint on version 4.1 (to be released soon)
and the LREV publication describing version 2.1:
The two most comprehensive publication on ParlaMint corpora are the LREV preprint describing version 4.1
and the LREV publication describing version 2.1:

- Tomaž Erjavec, Matyáš Kopp, Nikola Ljubešić, Taja Kuzman, Paul Rayson, Petya Osenova, Maciej
Ogrodniczuk, Çağrı Çöltekin, Danijel Koržinek, Katja Meden, Jure Skubic, Peter Rupnik, Tommaso
Expand Down
24 changes: 12 additions & 12 deletions Samples/ParlaMint-UA/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,16 +13,16 @@ Parliamentary meetings during one term are grouped into several sessions. Each f

Commonly there may be one or two parliamentary meetings per day (a morning and an evening sitting).

Although the official working language of the Rada is Ukrainian, some speeches during parliamentary proceedings between 2012 and 2023 were held in languages other than Ukrainian. All the speeches delivered by foreign guests were recorded in their translation into Ukrainian in the source texts. However, utterances by Ukrainian MPs and government officials that were produced in Russian were recorded in Russian. Total utterances in Russian comprise about 2% in the source texts, with most of them occurring before mid-2019, when the Law on Protecting the Functioning of the Ukrainian Language as the State Language came into effect.
Although the official working language of the Rada is Ukrainian, some speeches during the parliamentary proceedings on record were be held in other languages. All the speeches delivered by foreign guests in languages other than Ukrainian were recorded in their translation into Ukrainian in the source texts. However, utterances by Ukrainian MPs and government officials that were produced in Russian were recorded in Russian. With language identification done at the sentence level in the ParlaMint-UA 4.1 corpus, tokens in Ukrainian comprise 94% and tokens in Russian comprise 6% in the source texts. Instances of using Russian in the Verkhovna Rada occurred mostly before mid-2019, when the Law on Protecting the Functioning of the Ukrainian Language as the State Language came into effect.

The political system in Ukraine is multi-party, with 349 political parties on record at the country's Single Registry as of 1 January 2020. Contemporary political parties in Ukraine tend not to have clear-cut ideologies and centre around civilizational and geostrategic orientations, individual politicians or business interests. Also, renaming and rebranding political parties ahead of elections is not unusual. Parties that break the 5% electoral threshold form factions in the parliament. MPs elected on party lists may be either members of the respective parties or be nominated by those parties without membership. Parliamentary groups may consist of MPs who left a parliamentary faction, members of different political parties or independent politicians. An MP may be a member of only one parliamentary faction or group at a time.
The political system in Ukraine is multi-party, with 349 political parties on record at the country's Single Registry as of 1 January 2020. Contemporary political parties in Ukraine tend not to have clear-cut ideologies and centre around civilizational and geostrategic orientations, individual politicians or business interests. Also, renaming and rebranding political parties ahead of elections is not unusual. Parties that break the 5% electoral threshold form factions in the parliament. MPs elected on party lists may be either members of the respective parties or be nominated by those parties without membership. Parliamentary groups may consist of MPs who left a parliamentary faction, members of different political parties or independent politicians. An MP may be a member of only one parliamentary faction or group at a time. However, crossing the floor, i.e. formally changing one's political affiliation to a parliamentary faction or group different from the one an MP initially joined, is not exceptional in the Rada.


### Data source and acquisition

The ParlaMint-UA corpus contains proceedings for the 7th, 8th and 9th terms of the Rada between 12 December 2012 and 24 February 2023. Archived records of all plenary sittings are available through the open data portal at the Rada site in HTM format (https://data.rada.gov.ua/open/data/plenary/page5/sp?int) under the CC BY 4.0 licence.
The ParlaMint-UA 4.1 corpus contains proceedings for the 4th, 5th, 6th, 7th, 8th and 9th terms of the Rada between 14 May 2002 and 10 November 2023. Archived records of all plenary sittings are available through the open data portal at the Rada site in HTM format (https://data.rada.gov.ua/open/data/plenary/page5/sp?int) under the CC BY 4.0 licence.

The metadata related to MPs were in part retrieved from the Rada website and in part gathered manually from official sources including the Central Election Commission of Ukraine, the official periodical of the Rada and other open data sources. Metadata related to Cabinet members and guest speakers were gathered manually from the current sites of the Cabinet of Ministers of Ukraine and the Rada, archived copies of webpages from the sites of the Rada, the Cabinet of Ministers of Ukraine, and the President of Ukraine as well as various open data sources including NGOs’ websites, mass and social media, and Wikipedia.
The metadata related to MPs were in part retrieved from the Rada website and in part gathered manually from official sources including the Central Election Commission of Ukraine, and Holos Ukrainy, which is the official periodical of the Rada, as well as from other open data sources. Metadata related to Cabinet members and guest speakers were gathered manually from the current sites of the Cabinet of Ministers of Ukraine and the Rada, archived copies of webpages from the sites of the Rada, the Cabinet of Ministers of Ukraine, and the President of Ukraine as well as various open data sources including NGOs’ websites, mass and social media, and Wikipedia.

Since Chapel Hill expert surveys do not include Ukraine, the metadata on political orientation of the Ukrainian parties was obtained from Wikipedia, if available, and other sources including party webpages as well as analytical reports and publications by Ukrainian think tanks and research institutes.

Expand All @@ -31,24 +31,24 @@ Since Chapel Hill expert surveys do not include Ukraine, the metadata on politic

No correction of source texts was performed. Spaces were normalized. Sequences of dots were replaced with a single dot. Adjected notes were joined. Opening and closing parentheses were moved into notes if missing. Regular apostrophes were replaced with soft apostrophes, which are used in the Ukrainian language. No end-of-line hyphens were present in the source. Quotation marks have been left in the text and are not explicitly marked up. The texts were segmented into utterances (speeches) and segments (corresponding to paragraphs in the source transcription).

Language identification was based on expected frequencies for Ukrainian- and Russian-specific characters in the corpus (6.23 %(і) + 0.84 %(ї) + 0.39 %(є) + 0.01 %(ґ) = 7.47 % for Ukrainian, and 2.36 %(ы) + 0.36 %(э) + 0.2% (ё) + 0.02 %(ъ) = 2.94% for Russian), corpus-specific frequency word lists in Ukrainian and Russian, and Perl package Lingua::Identify::Any. A limitation of 250 characters was used for making decisions on language identification of shorter utterances based on Ukrainian-specific words, with a limitation of 100 characters for Russian-specific words.
Language identification was done at the sentence level using the https://github.com/pemistahl/lingua-py library. The following language identification procedure was used:
1) paragraphs were segmented into sentences with UDPipe1 and ukrainian-iu-ud-2.5-191206.udpipe model (language distinction was irrelevant at this stage, as it was assumed that overall sentence segmentation was the same in Ukrainian and Russian);
2) the language of sentences was identified;
3) adjected sentences were merged with the same language that was in the same paragraph to spans;
4) udpipe annotation was done with the respective span models ukrainian-iu-ud-2.12-230717 and russian-syntagrus-ud-2.12-230717;
5) paragraph language was set based on dominating token language (if equal, then Ukrainian).


### Corpus-specific metadata

The extended affiliation role “acting” was used for government officials who were appointed to serve in the role of a minister or a deputy minister on an interim basis but not to hold a respective office. Patronymic names were included as a surname type. The category of regular speakers embraced not only MPs and members of the Cabinet of Ministers but also deputy ministers who may speak in the Rada on behalf of the ministries they represent.

Also, metadata on all MPs from the 4th, 5th and 6th terms were stored, while they were available, with the intention to eventually include proceedings from the previous terms into the ParlaMint-UA corpus.

### Structure

There are no additional TEI structural elements beyond what is described in the ParlaMint schema.

### Linguistic annotation
POS tagging, lemmatization and dependency parsing were done with UDPipe 2 (http://ufal.mff.cuni.cz/udpipe/2) with ukrainian-iu-ud-2.10-220711 and russian-syntagrus-ud-2.10-220711 models.
POS tagging, lemmatization and dependency parsing were done with UDPipe 2 (http://ufal.mff.cuni.cz/udpipe/2) with ukrainian-iu-ud-2.12-230717 and russian-syntagrus-ud-2.12-230717 models.

The Ukrainian NER model was trained and deployed as part of the NameTag service (http://lindat.mff.cuni.cz/services/nametag/), with https://github.com/lang-uk/ner-uk dataset (data folder) used for training. We would like to thank [Jana Strakova](https://ufal.mff.cuni.cz/jana-strakova) for training the Ukrainian NER tool.

### Disclaimer to the English translation

Note that the automatically produced translation to English contains errors typical of neural machine translation, which also includes factual errors even when a high level of fluency is achieved, and any manual or automatic usage of this corpus should take the machine translation limitations into account.
The Ukrainian NER model was trained and deployed as part of the NameTag service (http://lindat.mff.cuni.cz/services/nametag/), with https://github.com/lang-uk/ner-uk dataset (data folder) used for training. We would like to thank [Jana Strakova](https://ufal.mff.cuni.cz/jana-strakova) for training the Ukrainian NER tool.

This file was deleted.

5 changes: 0 additions & 5 deletions Samples/ParlaMint-XX/2011/ParlaMint-XX_2011-05-05-0-meta.tsv

This file was deleted.

Loading

0 comments on commit 9df7751

Please sign in to comment.