Skip to content

Commit

Permalink
Merge pull request #856 from clarin-eric/data
Browse files Browse the repository at this point in the history
Data-main
  • Loading branch information
matyaskopp authored May 22, 2024
2 parents f2918ca + ccde51f commit 6eee33b
Showing 1 changed file with 6 additions and 1 deletion.
7 changes: 6 additions & 1 deletion Samples/ParlaMint-CZ/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ The Parliament of the Czech Republic (PCR) consists of two chambers: the Lower H

The Parliament works in the periods (terms) between one general election and the next. Regular meetings are organized and they typically take place more than one day. Each meeting has its own agenda and an agenda item is discussed in speeches that can be made at more than one sitting. For every term, there is a “nest”-style site to publish voting records, stenographic protocols, audio files, parliamentary prints, parliamentary documents, resolutions, decisions, interpellations, and biographical data about the members of PCR, boards, committees, delegations (e.g., see the site for the 9th term of the current Chamber of Deputies).

The ParlaMint-CZ corpus contains the stenographic protocols of the Chamber of Deputies from the period 25th Nov 2013 - 18th Oct 2022.
The ParlaMint-CZ corpus contains the stenographic protocols of the Chamber of Deputies from the period 25th Nov 2013 - 26th Jul 2023.

### Data source and acquisition

Expand Down Expand Up @@ -45,3 +45,8 @@ We did not use any TEI structural elements/attributes going beyond what’s desc
- For the UD annotation we used UDPipe 2 with no specifics.
- For the NER annotation we used NameTag 2 (model czech-cnec2.0-200831) that classifies named-entities according to a two-level hierarchy of nested 46 named entities types and fourth complex container types (address, person name, bibliography citation, temporal expression). This rich taxonomy contains not only proper names but other entity types as well: <date> and <time> for time expressions, <unit> for units, <num> for different types of numbers, <ref> hypertext links, <email> for email addresses.
- We merged the NameTag categories into the four categories used in ParlaMint (PER/ORG/LOC/MISC). In the ParlaMint-CZ corpus, both categories are available: ParlaMint categories are used for proper names and they are stored in type attribute and the NameTag categories are stored in ana attribute.

### Disclaimer to the English translation

Note that the automatically produced translation to English contains errors typical of neural machine translation, which also includes factual errors even when a high level of fluency is achieved, and any manual or automatic usage of this corpus should take the machine translation limitations into account.

0 comments on commit 6eee33b

Please sign in to comment.