Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data UA 4.0.1 #830

Open
wants to merge 15 commits into
base: data
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56,037 changes: 56,037 additions & 0 deletions Samples/ParlaMint-UA/ParlaMint-UA_2002-05-21-m0.ana.xml

Large diffs are not rendered by default.

636 changes: 636 additions & 0 deletions Samples/ParlaMint-UA/ParlaMint-UA_2002-05-21-m0.xml

Large diffs are not rendered by default.

62,442 changes: 62,442 additions & 0 deletions Samples/ParlaMint-UA/ParlaMint-UA_2003-04-17-m0.ana.xml

Large diffs are not rendered by default.

1,135 changes: 1,135 additions & 0 deletions Samples/ParlaMint-UA/ParlaMint-UA_2003-04-17-m0.xml

Large diffs are not rendered by default.

62,716 changes: 62,716 additions & 0 deletions Samples/ParlaMint-UA/ParlaMint-UA_2003-11-05-m0.ana.xml

Large diffs are not rendered by default.

672 changes: 672 additions & 0 deletions Samples/ParlaMint-UA/ParlaMint-UA_2003-11-05-m0.xml

Large diffs are not rendered by default.

41,997 changes: 41,997 additions & 0 deletions Samples/ParlaMint-UA/ParlaMint-UA_2005-11-15-m1.ana.xml

Large diffs are not rendered by default.

691 changes: 691 additions & 0 deletions Samples/ParlaMint-UA/ParlaMint-UA_2005-11-15-m1.xml

Large diffs are not rendered by default.

51,304 changes: 51,304 additions & 0 deletions Samples/ParlaMint-UA/ParlaMint-UA_2006-07-25-m1.ana.xml

Large diffs are not rendered by default.

638 changes: 638 additions & 0 deletions Samples/ParlaMint-UA/ParlaMint-UA_2006-07-25-m1.xml

Large diffs are not rendered by default.

50,968 changes: 50,968 additions & 0 deletions Samples/ParlaMint-UA/ParlaMint-UA_2007-04-05-m1.ana.xml

Large diffs are not rendered by default.

832 changes: 832 additions & 0 deletions Samples/ParlaMint-UA/ParlaMint-UA_2007-04-05-m1.xml

Large diffs are not rendered by default.

50,184 changes: 50,184 additions & 0 deletions Samples/ParlaMint-UA/ParlaMint-UA_2008-04-01-m0.ana.xml

Large diffs are not rendered by default.

804 changes: 804 additions & 0 deletions Samples/ParlaMint-UA/ParlaMint-UA_2008-04-01-m0.xml

Large diffs are not rendered by default.

57,886 changes: 57,886 additions & 0 deletions Samples/ParlaMint-UA/ParlaMint-UA_2009-07-01-m0.ana.xml

Large diffs are not rendered by default.

722 changes: 722 additions & 0 deletions Samples/ParlaMint-UA/ParlaMint-UA_2009-07-01-m0.xml

Large diffs are not rendered by default.

48,004 changes: 48,004 additions & 0 deletions Samples/ParlaMint-UA/ParlaMint-UA_2010-04-27-m0.ana.xml

Large diffs are not rendered by default.

790 changes: 790 additions & 0 deletions Samples/ParlaMint-UA/ParlaMint-UA_2010-04-27-m0.xml

Large diffs are not rendered by default.

57,748 changes: 57,748 additions & 0 deletions Samples/ParlaMint-UA/ParlaMint-UA_2011-12-20-m1.ana.xml

Large diffs are not rendered by default.

869 changes: 869 additions & 0 deletions Samples/ParlaMint-UA/ParlaMint-UA_2011-12-20-m1.xml

Large diffs are not rendered by default.

50,019 changes: 50,019 additions & 0 deletions Samples/ParlaMint-UA/ParlaMint-UA_2012-04-27-m0.ana.xml

Large diffs are not rendered by default.

745 changes: 745 additions & 0 deletions Samples/ParlaMint-UA/ParlaMint-UA_2012-04-27-m0.xml

Large diffs are not rendered by default.

50,580 changes: 50,580 additions & 0 deletions Samples/ParlaMint-UA/ParlaMint-UA_2013-10-22-m1.ana.xml

Large diffs are not rendered by default.

1,032 changes: 1,032 additions & 0 deletions Samples/ParlaMint-UA/ParlaMint-UA_2013-10-22-m1.xml

Large diffs are not rendered by default.

54,048 changes: 54,048 additions & 0 deletions Samples/ParlaMint-UA/ParlaMint-UA_2014-03-25-m1.ana.xml

Large diffs are not rendered by default.

803 changes: 803 additions & 0 deletions Samples/ParlaMint-UA/ParlaMint-UA_2014-03-25-m1.xml

Large diffs are not rendered by default.

45,026 changes: 45,026 additions & 0 deletions Samples/ParlaMint-UA/ParlaMint-UA_2015-03-17-m0.ana.xml

Large diffs are not rendered by default.

598 changes: 598 additions & 0 deletions Samples/ParlaMint-UA/ParlaMint-UA_2015-03-17-m0.xml

Large diffs are not rendered by default.

54,818 changes: 54,818 additions & 0 deletions Samples/ParlaMint-UA/ParlaMint-UA_2016-06-16-m1.ana.xml

Large diffs are not rendered by default.

845 changes: 845 additions & 0 deletions Samples/ParlaMint-UA/ParlaMint-UA_2016-06-16-m1.xml

Large diffs are not rendered by default.

48,889 changes: 48,889 additions & 0 deletions Samples/ParlaMint-UA/ParlaMint-UA_2017-12-06-m1.ana.xml

Large diffs are not rendered by default.

478 changes: 478 additions & 0 deletions Samples/ParlaMint-UA/ParlaMint-UA_2017-12-06-m1.xml

Large diffs are not rendered by default.

66,252 changes: 66,252 additions & 0 deletions Samples/ParlaMint-UA/ParlaMint-UA_2018-07-12-m1.ana.xml

Large diffs are not rendered by default.

1,036 changes: 1,036 additions & 0 deletions Samples/ParlaMint-UA/ParlaMint-UA_2018-07-12-m1.xml

Large diffs are not rendered by default.

57,717 changes: 57,717 additions & 0 deletions Samples/ParlaMint-UA/ParlaMint-UA_2019-12-17-m0.ana.xml

Large diffs are not rendered by default.

838 changes: 838 additions & 0 deletions Samples/ParlaMint-UA/ParlaMint-UA_2019-12-17-m0.xml

Large diffs are not rendered by default.

51,507 changes: 51,507 additions & 0 deletions Samples/ParlaMint-UA/ParlaMint-UA_2020-04-24-m0.ana.xml

Large diffs are not rendered by default.

731 changes: 731 additions & 0 deletions Samples/ParlaMint-UA/ParlaMint-UA_2020-04-24-m0.xml

Large diffs are not rendered by default.

47,150 changes: 47,150 additions & 0 deletions Samples/ParlaMint-UA/ParlaMint-UA_2021-03-30-m0.ana.xml

Large diffs are not rendered by default.

633 changes: 633 additions & 0 deletions Samples/ParlaMint-UA/ParlaMint-UA_2021-03-30-m0.xml

Large diffs are not rendered by default.

44,482 changes: 44,482 additions & 0 deletions Samples/ParlaMint-UA/ParlaMint-UA_2022-10-18-m0.ana.xml

Large diffs are not rendered by default.

638 changes: 638 additions & 0 deletions Samples/ParlaMint-UA/ParlaMint-UA_2022-10-18-m0.xml

Large diffs are not rendered by default.

44,872 changes: 44,872 additions & 0 deletions Samples/ParlaMint-UA/ParlaMint-UA_2023-07-26-m0.ana.xml

Large diffs are not rendered by default.

1,234 changes: 1,234 additions & 0 deletions Samples/ParlaMint-UA/ParlaMint-UA_2023-07-26-m0.xml

Large diffs are not rendered by default.

45 changes: 33 additions & 12 deletions Samples/ParlaMint-UA/ParlaMint-taxonomy-NER.ana.xml
Original file line number Diff line number Diff line change
@@ -1,22 +1,43 @@
<?xml version="1.0" encoding="UTF-8"?>
<taxonomy xmlns="http://www.tei-c.org/ns/1.0" xml:id="ParlaMint-taxonomy-NER.ana" xml:lang="mul">
<desc xml:lang="en"><term>Named entities</term></desc>
<desc xml:lang="uk"><term>Іменовані сутності</term></desc>
<taxonomy xmlns="http://www.tei-c.org/ns/1.0"
xml:id="ParlaMint-taxonomy-NER.ana"
xml:lang="mul">
<desc xml:lang="en">
<term>Named entities</term>
</desc>
<desc xml:lang="uk">
<term>Іменовані сутності</term>
</desc>
<category xml:id="PER">
<catDesc xml:lang="en"><term>person</term></catDesc>
<catDesc xml:lang="uk"><term>власна назва людини</term></catDesc>
<catDesc xml:lang="en">
<term>person</term>
</catDesc>
<catDesc xml:lang="uk">
<term>власна назва людини</term>
</catDesc>
</category>
<category xml:id="LOC">
<catDesc xml:lang="en"><term>location</term></catDesc>
<catDesc xml:lang="uk"><term>географічна назва</term></catDesc>
<catDesc xml:lang="en">
<term>location</term>
</catDesc>
<catDesc xml:lang="uk">
<term>географічна назва</term>
</catDesc>
</category>
<category xml:id="ORG">
<catDesc xml:lang="en"><term>organization</term></catDesc>
<catDesc xml:lang="uk"><term>назва організації</term></catDesc>
<catDesc xml:lang="en">
<term>organization</term>
</catDesc>
<catDesc xml:lang="uk">
<term>назва організації</term>
</catDesc>
</category>
<category xml:id="MISC">
<catDesc xml:lang="en"><term>miscellaneous</term></catDesc>
<catDesc xml:lang="uk"><term>різне</term></catDesc>
<catDesc xml:lang="en">
<term>miscellaneous</term>
</catDesc>
<catDesc xml:lang="uk">
<term>різне</term>
</catDesc>
</category>
</taxonomy>

273 changes: 203 additions & 70 deletions Samples/ParlaMint-UA/ParlaMint-taxonomy-parla.legislature.xml

Large diffs are not rendered by default.

31 changes: 21 additions & 10 deletions Samples/ParlaMint-UA/ParlaMint-taxonomy-speaker_types.xml
Original file line number Diff line number Diff line change
@@ -1,18 +1,29 @@
<?xml version="1.0" encoding="UTF-8"?>
<taxonomy xmlns="http://www.tei-c.org/ns/1.0" xml:id="ParlaMint-taxonomy-speaker_types" xml:lang="mul">
<desc xml:lang="en"><term>Types of speakers</term></desc>
<desc xml:lang="uk"><term>Типи промовців</term></desc>
<taxonomy xmlns="http://www.tei-c.org/ns/1.0"
xml:id="ParlaMint-taxonomy-speaker_types"
xml:lang="mul">
<desc xml:lang="en">
<term>Types of speakers</term>
</desc>
<desc xml:lang="uk">
<term>Типи промовців</term>
</desc>
<category xml:id="chair">
<catDesc xml:lang="en"><term>Chairperson</term>: chairman of a sitting</catDesc>
<catDesc xml:lang="uk"><term>головуючий</term>: головуючий на засіданні</catDesc>
<catDesc xml:lang="en">
<term>Chairperson</term>: chairman of a sitting</catDesc>
<catDesc xml:lang="uk">
<term>головуючий</term>: головуючий на засіданні</catDesc>
</category>
<category xml:id="regular">
<catDesc xml:lang="en"><term>Regular</term>: a regular speaker at a sitting</catDesc>
<catDesc xml:lang="uk"><term>регулярний</term>: народний депутат або представник уряду, який бере участь у засіданні</catDesc>
<catDesc xml:lang="en">
<term>Regular</term>: a regular speaker at a sitting</catDesc>
<catDesc xml:lang="uk">
<term>регулярний</term>: народний депутат або представник уряду, який бере участь у засіданні</catDesc>
</category>
<category xml:id="guest">
<catDesc xml:lang="en"><term>Guest</term>: a guest speaker at a sitting</catDesc>
<catDesc xml:lang="uk"><term>гість</term>: промовець на засіданні, який не є народним депутатом або представником уряду</catDesc>
<catDesc xml:lang="en">
<term>Guest</term>: a guest speaker at a sitting</catDesc>
<catDesc xml:lang="uk">
<term>гість</term>: промовець на засіданні, який не є народним депутатом або представником уряду</catDesc>
</category>
</taxonomy>

33 changes: 23 additions & 10 deletions Samples/ParlaMint-UA/ParlaMint-taxonomy-subcorpus.xml
Original file line number Diff line number Diff line change
@@ -1,18 +1,31 @@
<?xml version="1.0" encoding="UTF-8"?>
<taxonomy xmlns="http://www.tei-c.org/ns/1.0" xml:id="ParlaMint-taxonomy-subcorpus" xml:lang="mul">
<desc xml:lang="en"><term>Subcorpora</term></desc>
<desc xml:lang="uk"><term>підкорпуси</term></desc>
<taxonomy xmlns="http://www.tei-c.org/ns/1.0"
xml:id="ParlaMint-taxonomy-subcorpus"
xml:lang="mul">
<desc xml:lang="en">
<term>Subcorpora</term>
</desc>
<desc xml:lang="uk">
<term>підкорпуси</term>
</desc>
<category xml:id="reference">
<catDesc xml:lang="en"><term>Reference</term>: reference subcorpus, until 2020-01-30</catDesc>
<catDesc xml:lang="uk"><term>референтний</term>: референтний підкорпус, період до 2020-01-30</catDesc>
<catDesc xml:lang="en">
<term>Reference</term>: reference subcorpus, until 2020-01-30</catDesc>
<catDesc xml:lang="uk">
<term>референтний</term>: референтний підкорпус, період до 2020-01-30</catDesc>
</category>
<category xml:id="covid">
<catDesc xml:lang="en"><term>COVID</term>: COVID subcorpus, from 2020-01-31 onwards, when WHO made the formal declaration of PHEIC, i.e. the Public Health Emergency of International Concern for COVID-19</catDesc>
<catDesc xml:lang="uk"><term>ковідний</term>: ковідний підкорпус, період після 2020-01-31</catDesc>
<catDesc xml:lang="en">
<term>COVID</term>: COVID subcorpus, from 2020-01-31 onwards, when WHO made the formal
declaration of PHEIC, i.e. the Public Health Emergency of International Concern for COVID-19</catDesc>
<catDesc xml:lang="uk">
<term>ковідний</term>: ковідний підкорпус, період після 2020-01-31</catDesc>
</category>
<category xml:id="war">
<catDesc xml:lang="en"><term>War</term>: War in Ukraine subcorpus, from 2022-02-24 onwards, i.e. from Russia's full-scale invasion of Ukraine</catDesc>
<catDesc xml:lang="uk"><term>Війна</term>: Підкорпус охоплює період війни в Україні, починаючи з 24 лютого 2022 року, тобто з часу повномасштабного вторгнення Росії в Україну.</catDesc>
<catDesc xml:lang="en">
<term>War</term>: War in Ukraine subcorpus, from 2022-02-24 onwards, i.e. from Russia's
full-scale invasion of Ukraine</catDesc>
<catDesc xml:lang="uk">
<term>Війна</term>: Підкорпус охоплює період війни в Україні, починаючи з 24 лютого 2022 року, тобто з часу повномасштабного вторгнення Росії в Україну.</catDesc>
</category>
</taxonomy>

22 changes: 11 additions & 11 deletions Samples/ParlaMint-UA/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,16 +13,16 @@ Parliamentary meetings during one term are grouped into several sessions. Each f

Commonly there may be one or two parliamentary meetings per day (a morning and an evening sitting).

Although the official working language of the Rada is Ukrainian, some speeches during parliamentary proceedings between 2012 and 2023 were held in languages other than Ukrainian. All the speeches delivered by foreign guests were recorded in their translation into Ukrainian in the source texts. However, utterances by Ukrainian MPs and government officials that were produced in Russian were recorded in Russian. Total utterances in Russian comprise about 2% in the source texts, with most of them occurring before mid-2019, when the Law on Protecting the Functioning of the Ukrainian Language as the State Language came into effect.
Although the official working language of the Rada is Ukrainian, some speeches during the parliamentary proceedings on record were be held in other languages. All the speeches delivered by foreign guests in languages other than Ukrainian were recorded in their translation into Ukrainian in the source texts. However, utterances by Ukrainian MPs and government officials that were produced in Russian were recorded in Russian. With language identification done at the sentence level in the ParlaMint-UA 4.1 corpus, tokens in Ukrainian comprise 94% and tokens in Russian comprise 6% in the source texts. Instances of using Russian in the Verkhovna Rada occurred mostly before mid-2019, when the Law on Protecting the Functioning of the Ukrainian Language as the State Language came into effect.

The political system in Ukraine is multi-party, with 349 political parties on record at the country's Single Registry as of 1 January 2020. Contemporary political parties in Ukraine tend not to have clear-cut ideologies and centre around civilizational and geostrategic orientations, individual politicians or business interests. Also, renaming and rebranding political parties ahead of elections is not unusual. Parties that break the 5% electoral threshold form factions in the parliament. MPs elected on party lists may be either members of the respective parties or be nominated by those parties without membership. Parliamentary groups may consist of MPs who left a parliamentary faction, members of different political parties or independent politicians. An MP may be a member of only one parliamentary faction or group at a time.
The political system in Ukraine is multi-party, with 349 political parties on record at the country's Single Registry as of 1 January 2020. Contemporary political parties in Ukraine tend not to have clear-cut ideologies and centre around civilizational and geostrategic orientations, individual politicians or business interests. Also, renaming and rebranding political parties ahead of elections is not unusual. Parties that break the 5% electoral threshold form factions in the parliament. MPs elected on party lists may be either members of the respective parties or be nominated by those parties without membership. Parliamentary groups may consist of MPs who left a parliamentary faction, members of different political parties or independent politicians. An MP may be a member of only one parliamentary faction or group at a time. However, crossing the floor, i.e. formally changing one's political affiliation to a parliamentary faction or group different from the one an MP initially joined, is not exceptional in the Rada.


### Data source and acquisition

The ParlaMint-UA corpus contains proceedings for the 7th, 8th and 9th terms of the Rada between 12 December 2012 and 24 February 2023. Archived records of all plenary sittings are available through the open data portal at the Rada site in HTM format (https://data.rada.gov.ua/open/data/plenary/page5/sp?int) under the CC BY 4.0 licence.
The ParlaMint-UA 4.1 corpus contains proceedings for the 4th, 5th, 6th, 7th, 8th and 9th terms of the Rada between 14 May 2002 and 10 November 2023. Archived records of all plenary sittings are available through the open data portal at the Rada site in HTM format (https://data.rada.gov.ua/open/data/plenary/page5/sp?int) under the CC BY 4.0 licence.

The metadata related to MPs were in part retrieved from the Rada website and in part gathered manually from official sources including the Central Election Commission of Ukraine, the official periodical of the Rada and other open data sources. Metadata related to Cabinet members and guest speakers were gathered manually from the current sites of the Cabinet of Ministers of Ukraine and the Rada, archived copies of webpages from the sites of the Rada, the Cabinet of Ministers of Ukraine, and the President of Ukraine as well as various open data sources including NGOs’ websites, mass and social media, and Wikipedia.
The metadata related to MPs were in part retrieved from the Rada website and in part gathered manually from official sources including the Central Election Commission of Ukraine, and Holos Ukrainy, which is the official periodical of the Rada, as well as from other open data sources. Metadata related to Cabinet members and guest speakers were gathered manually from the current sites of the Cabinet of Ministers of Ukraine and the Rada, archived copies of webpages from the sites of the Rada, the Cabinet of Ministers of Ukraine, and the President of Ukraine as well as various open data sources including NGOs’ websites, mass and social media, and Wikipedia.

Since Chapel Hill expert surveys do not include Ukraine, the metadata on political orientation of the Ukrainian parties was obtained from Wikipedia, if available, and other sources including party webpages as well as analytical reports and publications by Ukrainian think tanks and research institutes.

Expand All @@ -31,24 +31,24 @@ Since Chapel Hill expert surveys do not include Ukraine, the metadata on politic

No correction of source texts was performed. Spaces were normalized. Sequences of dots were replaced with a single dot. Adjected notes were joined. Opening and closing parentheses were moved into notes if missing. Regular apostrophes were replaced with soft apostrophes, which are used in the Ukrainian language. No end-of-line hyphens were present in the source. Quotation marks have been left in the text and are not explicitly marked up. The texts were segmented into utterances (speeches) and segments (corresponding to paragraphs in the source transcription).

Language identification was based on expected frequencies for Ukrainian- and Russian-specific characters in the corpus (6.23 %(і) + 0.84 %(ї) + 0.39 %(є) + 0.01 %(ґ) = 7.47 % for Ukrainian, and 2.36 %(ы) + 0.36 %(э) + 0.2% (ё) + 0.02 %(ъ) = 2.94% for Russian), corpus-specific frequency word lists in Ukrainian and Russian, and Perl package Lingua::Identify::Any. A limitation of 250 characters was used for making decisions on language identification of shorter utterances based on Ukrainian-specific words, with a limitation of 100 characters for Russian-specific words.
Language identification was done at the sentence level using the https://github.com/pemistahl/lingua-py library. The following language identification procedure was used:
1) paragraphs were segmented into sentences with UDPipe1 and ukrainian-iu-ud-2.5-191206.udpipe model (language distinction was irrelevant at this stage, as it was assumed that overall sentence segmentation was the same in Ukrainian and Russian);
2) the language of sentences was identified;
3) adjected sentences were merged with the same language that was in the same paragraph to spans;
4) udpipe annotation was done with the respective span models ukrainian-iu-ud-2.12-230717 and russian-syntagrus-ud-2.12-230717;
5) paragraph language was set based on dominating token language (if equal, then Ukrainian).


### Corpus-specific metadata

The extended affiliation role “acting” was used for government officials who were appointed to serve in the role of a minister or a deputy minister on an interim basis but not to hold a respective office. Patronymic names were included as a surname type. The category of regular speakers embraced not only MPs and members of the Cabinet of Ministers but also deputy ministers who may speak in the Rada on behalf of the ministries they represent.

Also, metadata on all MPs from the 4th, 5th and 6th terms were stored, while they were available, with the intention to eventually include proceedings from the previous terms into the ParlaMint-UA corpus.

### Structure

There are no additional TEI structural elements beyond what is described in the ParlaMint schema.

### Linguistic annotation
POS tagging, lemmatization and dependency parsing were done with UDPipe 2 (http://ufal.mff.cuni.cz/udpipe/2) with ukrainian-iu-ud-2.10-220711 and russian-syntagrus-ud-2.10-220711 models.
POS tagging, lemmatization and dependency parsing were done with UDPipe 2 (http://ufal.mff.cuni.cz/udpipe/2) with ukrainian-iu-ud-2.12-230717 and russian-syntagrus-ud-2.12-230717 models.

The Ukrainian NER model was trained and deployed as part of the NameTag service (http://lindat.mff.cuni.cz/services/nametag/), with https://github.com/lang-uk/ner-uk dataset (data folder) used for training. We would like to thank [Jana Strakova](https://ufal.mff.cuni.cz/jana-strakova) for training the Ukrainian NER tool.

### Disclaimer to the English translation

Note that the automatically produced translation to English contains errors typical of neural machine translation, which also includes factual errors even when a high level of fluency is achieved, and any manual or automatic usage of this corpus should take the machine translation limitations into account.
Loading