Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RO Feedback #626

Open
12 of 14 tasks
matyaskopp opened this issue Mar 24, 2023 · 32 comments · May be fixed by #625
Open
12 of 14 tasks

RO Feedback #626

matyaskopp opened this issue Mar 24, 2023 · 32 comments · May be fixed by #625
Assignees
Milestone

Comments

@matyaskopp
Copy link
Collaborator

matyaskopp commented Mar 24, 2023

meeting element

  • extend meeting elements (#parla.term, #parla.sitting)

I haven't found any information about terms or sitting in the meeting elements. This is how other corpora implement it:

<meeting ana="#parla.term #parla.uni" n="8" corresp="#ВРУ">8</meeting>
<meeting ana="#parla.session #parla.uni" n="1" corresp="#ВРУ">1</meeting>
<meeting ana="#parla.sitting #parla.uni" n="2014-12-02" corresp="#ВРУ">2014-12-02</meeting>

I was not able to find term info on Romanian parliament websites - I believe the information is there.
And if a single file contains one sitting, then add sitting identification.

Missing speech content

  • speech content

In some files there is no speech content:
https://github.com/romanian-parlamint/ParlaMint/blob/8439dd75ca3c31b89f06bac23eff736a72a6ed6a/Data/ParlaMint-RO/ParlaMint-RO_2000-09-04-id4959.xml

        <note type="time">Şedinţa a început la ora 15,55.</note>
        <note type="chairman">Lucrările au fost conduse de domnul Ion Diaconescu, preşedintele Camerei Deputaţilor, asistat de domnii Andrei Ioan Chiliman şi Acsinte Gaspar, secretari.</note>
        <note type="speaker">Domnul Ion Diaconescu:</note>
        <u ana="#chair" who="#Ion-Diaconescu" xml:id="ParlaMint-RO_2000-09-04-id4959.u1"/>
        <note type="speaker">Domnul Iuliu Ioan Furo:</note>

but the source contains speech contents:
https://www.cdep.ro/pls/steno/steno2015.stenograma?ids=4959&idl=1#S0

Chairman note type

        <note type="chairman">Lucrările au fost conduse de domnul Ion Diaconescu, preşedintele Camerei Deputaţilor, asistat de domnii Andrei Ioan Chiliman şi Acsinte Gaspar, secretari.</note>

not recognized notes

  • notes in text

Notes are in source italics so easy to recognize...

https://github.com/romanian-parlamint/ParlaMint/blob/8439dd75ca3c31b89f06bac23eff736a72a6ed6a/Data/ParlaMint-RO/ParlaMint-RO_2000-04-14-id4927.xml#L474

<seg xml:id="ParlaMint-RO_2000-04-14-id4927.u39.seg6">Cine este pentru?(Vociferără în partea dreaptă a sălii).Vă rog să număraţi... Vă rog să ridicaţi mâna, cei care sunteţi pentru acest amendament, să repetăm numărătoarea. Este o confuzie.</seg>

image

should be: (https://clarin-eric.github.io/ParlaMint/#TEI.vocal)

<seg xml:id="ParlaMint-RO_2000-04-14-id4927.u39.seg6">Cine este pentru? <vocal type="shouting">
    <desc>(Vociferără în partea dreaptă a sălii)</desc>
  </vocal> Vă rog să număraţi... Vă rog să ridicaţi mâna, cei care sunteţi pentru acest amendament, să repetăm numărătoarea. Este o confuzie.</seg>

presence list

  • presence list is missing status

https://github.com/romanian-parlamint/ParlaMint/blob/8439dd75ca3c31b89f06bac23eff736a72a6ed6a/Data/ParlaMint-RO/ParlaMint-RO_2000-04-14-id4927.xml#L510-L513

        <u ana="#regular" who="#Andrei-Ioan-Chiliman" xml:id="ParlaMint-RO_2000-04-14-id4927.u46">
          <seg xml:id="ParlaMint-RO_2000-04-14-id4927.u46.seg1">Achimescu Victor Ştefan</seg>
          <seg xml:id="ParlaMint-RO_2000-04-14-id4927.u46.seg2">Aferăriţei Constantin</seg>
          <seg xml:id="ParlaMint-RO_2000-04-14-id4927.u46.seg3">Afrăsinei Viorica</seg>

image

corpus timespan

  • corpus timespan bibl
  • corpus timespan setting
  • corpus timespan it would be nice to have it in text content of corpus title too

https://github.com/romanian-parlamint/ParlaMint/blob/8439dd75ca3c31b89f06bac23eff736a72a6ed6a/Data/ParlaMint-RO/ParlaMint-RO.xml#L72

        <bibl>
          <title type="main" xml:lang="en">Meeting minutes of the Romanian Parliament</title>
          <title type="main" xml:lang="ro">Stenograme ale şedinţelor din Parlamentul României</title>
          <idno type="URI">http://www.parlament.ro/</idno>
          <date from="2000-02-01" to="2020-11-24">2000-02-01 - 2020-11-24</date>
        </bibl>

https://github.com/romanian-parlamint/ParlaMint/blob/8439dd75ca3c31b89f06bac23eff736a72a6ed6a/Data/ParlaMint-RO/ParlaMint-RO.xml#L252

        <setting>
          <name type="city">Bucharest</name>
          <name type="place">Palace of the Parliament</name>
          <date from="2000-02-01" to="2020-11-24"/>
        </setting>

setting element

  • setting element in root file

root file setting element should correspond to component ones (missing country)

https://github.com/romanian-parlamint/ParlaMint/blob/8439dd75ca3c31b89f06bac23eff736a72a6ed6a/Data/ParlaMint-RO/ParlaMint-RO.xml#L249-L253

        <setting>
          <name type="city">Bucharest</name>
          <name type="place">Palace of the Parliament</name>
          <date from="2000-02-01" to="2020-11-24"/>
        </setting>

vs:
https://github.com/romanian-parlamint/ParlaMint/blob/8439dd75ca3c31b89f06bac23eff736a72a6ed6a/Data/ParlaMint-RO/ParlaMint-RO_2000-04-14-id4927.xml#L97-L101

        <setting>
          <name type="city">Bucharest</name>
          <name type="country" key="RO">Romania</name>
          <date when="2000-04-14" ana="#parla.sitting">14.04.2000</date>
        </setting>

capitalize surname

  • dont capitalize surname

https://github.com/romanian-parlamint/ParlaMint/blob/8439dd75ca3c31b89f06bac23eff736a72a6ed6a/Data/ParlaMint-RO/ParlaMint-RO.xml#L384

              <surname>GORGHIU</surname>

should be

              <surname>Gorghiu</surname>

sort component files

  • sort component files

The component files should be ordered according to the contents' date.

taxonomies

  • translations
  • wrong language context - English content in xml:lang="ro"
  • missing descriptions
@matyaskopp matyaskopp linked a pull request Mar 24, 2023 that will close this issue
@RePierre
Copy link
Collaborator

Changed the capitalization of surnames with commit 51787f7.

@RePierre
Copy link
Collaborator

Sorted component files in commit be08d9a.

@RePierre
Copy link
Collaborator

Changed note type to narrative with commit 9fe5f43.

@RePierre
Copy link
Collaborator

Converted notes into more specific elements within segments with commit cc386af.

@matyaskopp
Copy link
Collaborator Author

matyaskopp commented Mar 28, 2023

Spaces around notes

  • spaces around notes inside text

Converted notes into more specific elements within segments with commit cc386af.

You have removed spaces around notes which can cause troubles in tokenization... It can happen that the note is inside the token (= unexpected behaviour of my annotation script).
https://github.com/romanian-parlamint/ParlaMint/blob/cc386afc90e1298cb4f4d79f44d5558949e4eeae/Data/ParlaMint-RO/ParlaMint-RO_2000-04-14-id4927.xml#L472

<seg xml:id="ParlaMint-RO_2000-04-14-id4927.u39.seg6">Cine este pentru?<vocal type="shouting"><desc>(Vociferără în partea dreaptă a sălii).</desc></vocal>Vă <!-- ... --> confuzie.</seg>

image
Should be:

<seg xml:id="ParlaMint-RO_2000-04-14-id4927.u39.seg6">Cine este pentru? <vocal type="shouting">
  <desc>(Vociferără în partea dreaptă a sălii).</desc>
</vocal> Vă <!-- ... --> confuzie.</seg>

@RePierre
Copy link
Collaborator

Added spaces around notes with commit 79b08b1.

@RePierre
Copy link
Collaborator

wrong language context - English content in xml:lang="ro"

Can you please provide an example?

I ran find -type f -name *.xml -exec grep --color=auto -i -nH --null -e lang\=\"ro\" \{\} +, went over all results, and wasn't able to find English content. Maybe I'm missing something?

@matyaskopp
Copy link
Collaborator Author

Can you please provide an example?

I ran find -type f -name *.xml -exec grep --color=auto -i -nH --null -e lang\=\"ro\" \{\} +, went over all results, and wasn't able to find English content. Maybe I'm missing something?

Oh, sorry - your <teiCorpus> is in English context:

<teiCorpus xmlns="http://www.tei-c.org/ns/1.0" xml:lang="en" xml:id="ParlaMint-RO">

This is the only corpus that has it. I implicitly expected that it has xml:lang="ro"

To search language context of <term> I now used

java -cp /usr/share/java/saxon.jar net.sf.saxon.Query -xi:off \!method=adaptive -qs:'//*[name()="term" and ./ancestor::*[@xml:lang][1]/@xml:lang="ro"]' -s:ParlaMint-RO/ParlaMint-RO.xml
<term xmlns="http://www.tei-c.org/ns/1.0">Legislatură</term>
<term xmlns="http://www.tei-c.org/ns/1.0">Unități geo-politice sau administrative</term>
<term xmlns="http://www.tei-c.org/ns/1.0">Legislatură națională</term>
<term xmlns="http://www.tei-c.org/ns/1.0">Organizație politică</term>
<term xmlns="http://www.tei-c.org/ns/1.0">Camere</term>
<term xmlns="http://www.tei-c.org/ns/1.0">Parlament bicameral</term>
<term xmlns="http://www.tei-c.org/ns/1.0">Senat</term>
<term xmlns="http://www.tei-c.org/ns/1.0">Camera deputaților</term>

The majority language in teiCorpus is usually English, so you have it correctly according to the documentation:

@xml:lang is also a global attribute and gives the language code of the text content of the element; for the corpus root this does not (just) mean the content of its TEI header, but primarily the textual content of its XIncluded components. The convention is that language of the text content of an element is determined by the value of the first @xml:lang attribute on its ancestor axis. In cases where the content is multilingual, the language code should be of the majority language. When the proportion of the languages is about equal, then the mul code for multiple languages can also be used.

but it is common to have the corpus language...

@TomazErjavec Can be english preserved in teiCorpus here?

@RePierre
Copy link
Collaborator

RePierre commented Mar 28, 2023

Normalized setting element in corpus root file and component files and set corpus span with commit d343920.

Should resolve:

setting element in root file
corpus timespan setting

@TomazErjavec
Copy link
Collaborator

@TomazErjavec Can be english preserved in teiCorpus here?

In practice I'd much rather not have an exception. So, teiCorpus and TEI should have @xml:lang="ro".
But maybe teiHeader with @xml:lang="en" is legit?

@RePierre
Copy link
Collaborator

Changed language of the teiCorpus element in commit 548e357.

@matyaskopp
Copy link
Collaborator Author

matyaskopp commented Mar 29, 2023

Duplicite person

  • duplicite person

Every person should have one record in listPerson:
https://github.com/romanian-parlamint/ParlaMint/blob/548e3576054c9067aee43fb2275b879cac9ba806/Data/ParlaMint-RO/ParlaMint-RO.xml#L1306-L1324

          <person xml:id="Augustin-Lucian-Bolcas">
            <persName>
              <forename>Lucian</forename>
              <forename>Augustin</forename>
              <surname>Bolcaș</surname>
            </persName>
            <sex value="M"/>
            <affiliation ana="#RoParl.51" ref="#RoParl" role="member" from="2000-12-15" to="2004-11-30"/>
          </person>
          <person xml:id="Lucian-Augustin-Bolcas">
            <persName>
              <forename>Lucian</forename>
              <forename>Augustin</forename>
              <surname>Bolcaș</surname>
            </persName>
            <sex value="M"/>
            <affiliation ana="#RoParl.51" ref="#RoParl" role="member" from="2000-12-15" to="2004-11-30"/>
            <affiliation ana="#RoParl.52" ref="#RoParl" role="member" from="2004-12-19" to="2008-12-13"/>
          </person>

Necunoscut Necunoscut person's name

  • Necunoscut Necunoscut

first occurence:
https://github.com/romanian-parlamint/ParlaMint/blob/548e3576054c9067aee43fb2275b879cac9ba806/Data/ParlaMint-RO/ParlaMint-RO.xml#L6030

          <person xml:id="Dan-Dumitrescu">
            <persName>
              <forename>Necunoscut</forename>
              <surname>Necunoscut</surname>
            </persName>
            <sex value="U"/>
            <affiliation ana="#RoParl.55" ref="#RoParl" role="member" from="2016-12-21" to="2020-12-20"/>
          </person>

@RePierre
Copy link
Collaborator

Missing speech content

As suggested by @TomazErjavec, added <gap> elements to the utterances without segments in commit 0082dd3.

@RePierre
Copy link
Collaborator

Duplicite person

Fixed duplicate person with commit ac9a2bc.

@RePierre
Copy link
Collaborator

RePierre commented Apr 11, 2023

corpus timespan bibl

Included corpus timespan in <bibl> element with commit 70b7fc2.

@RePierre
Copy link
Collaborator

corpus timespan it would be nice to have it in text content of corpus title too

Included corpus span in corpus subtitle with commit df3879b.

@RePierre
Copy link
Collaborator

presence list is missing status

As discussed in the meeting on April 12, we cannot provide the presence list in time for this version because this requires changes in the crawlers of the session transcripts. I will try to include this data into a future version of the corpus.

@RePierre
Copy link
Collaborator

extend meeting elements (#parla.term, #parla.sitting)

Extended meeting elements with term and sitting information with commit 75affa9.

@matyaskopp
Copy link
Collaborator Author

matyaskopp commented May 24, 2023

  • include annotated component files

Error: /home/runner/work/ParlaMint/ParlaMint/ParlaMint/Data/ParlaMint-RO/ParlaMint-RO_2015-09-29-id7560.xml:132:189: error: text not allowed here; expected element "gap", "incident", "kinesic", "note", "pb", "s" or "vocal"

@RePierre, you include unannotated files (TEI) in annotated (TEI.ana) root file:
https://github.com/romanian-parlamint/ParlaMint/blob/459b829a1e053df1e22502222324d246be1c9a47/Data/ParlaMint-RO/ParlaMint-RO.ana.xml#L3018-L3027
eg

<xsi:include xmlns:xsi="http://www.w3.org/2001/XInclude" href="ParlaMint-RO_2015-09-29-id7560.xml"/>

should be

<xsi:include xmlns:xsi="http://www.w3.org/2001/XInclude" href="ParlaMint-RO_2015-09-29-id7560.ana.xml"/>

@RePierre
Copy link
Collaborator

include annotated component files

Included proper component files in commit 90da93b.

@matyaskopp
Copy link
Collaborator Author

@RePierre, thanks for the progress.

I have spotted an issue in the TEI.ana version of the files:

wrongly placed notes in the TEI.ana version

  • notes are placed at the beginning of seg
  • unannotated text after the first note

Data/ParlaMint-RO/ParlaMint-RO_2015-09-29-id7560.ana.xml:6433:284: error: text not allowed here; expected the element end-tag or element "gap", "incident", "kinesic", "note", "pb", "s" or "vocal"

TEI: (https://github.com/romanian-parlamint/ParlaMint/blob/5f986e2cc79e3f28347c6a655416c7f4f4d57a1c/Data/ParlaMint-RO/ParlaMint-RO_2015-09-29-id7560.xml#L284)

<seg xml:id="ParlaMint-RO_2015-09-29-id7560.u11.seg8">Cred <!--
... 
--> salariile. <vocal type="noise"><desc>(Aplauze.)</desc></vocal> Însă<!--
...
--> toţi.</seg>

TEI.ana:

<seg xml:id="ParlaMint-RO_2015-09-29-id7560.u11.seg8"><vocal type="noise"><desc>(Aplauze.)</desc></vocal> Însă<!--
...
-->toţi.<s xml:id="ParlaMint-RO_2015-09-29-id7560.u11.seg8.1">
  <w xml:id="ParlaMint-RO_2015-09-29-id7560.u11.seg8.1.1" lemma="Cred" pos="Vmip1s" msd="UPosTag=AUX|Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin">Cred</w>
<!--... -->
</s>
<!--... -->
</seg>

@matyaskopp
Copy link
Collaborator Author

Unrecognized full-paragraph note

  • "full-paragraph" notes

https://github.com/romanian-parlamint/ParlaMint/blob/a510c149ba04407fe6df77414b3a2aaec6f47022/Data/ParlaMint-RO/ParlaMint-RO_2006-09-18-id6154.xml#L422-L424

  <seg xml:id="ParlaMint-RO_2006-09-18-id6154.u33.seg8">Mulţumesc.</seg>
  <seg xml:id="ParlaMint-RO_2006-09-18-id6154.u33.seg9">(Domnul Valeriu Ştefan Zgonea se îndreaptă spre prezidiu.)</seg>
</u>

should be:

  <seg xml:id="ParlaMint-RO_2006-09-18-id6154.u33.seg8">Mulţumesc.</seg>
</u>
<note type="narrative">(Domnul Valeriu Ştefan Zgonea se îndreaptă spre prezidiu.)</note>

Other occurrences in sample data:

DataForks/ParlaMint-RO/ParlaMint-RO_2006-09-18-id6154.xml:411:          <seg xml:id="ParlaMint-RO_2006-09-18-id6154.u32.seg4">(Domnul Valeriu Ştefan Zgonea părăseşte prezidiul şi se îndreaptă spre tribună.)</seg>
DataForks/ParlaMint-RO/ParlaMint-RO_2006-09-18-id6154.xml:423:          <seg xml:id="ParlaMint-RO_2006-09-18-id6154.u33.seg9">(Domnul Valeriu Ştefan Zgonea se îndreaptă spre prezidiu.)</seg>

@matyaskopp
Copy link
Collaborator Author

U+0096 (SPA) Unicode Character

  • remove <0x0096> character

This character is allowed in ParlaMint, but it causes problems in linguistic annotations, I suggest removing it from the text: https://github.com/romanian-parlamint/ParlaMint/blob/a510c149ba04407fe6df77414b3a2aaec6f47022/Data/ParlaMint-RO/ParlaMint-RO_2000-10-24-id4980.xml#L148

<seg xml:id="ParlaMint-RO_2000-10-24-id4980.u2.seg5">După <!--
...
--> urgie � 1940. Dar n-a fost să fie aşa.</seg>
<w xml:id="ParlaMint-RO_2000-10-24-id4980.u2.seg5.1.29" lemma="" pos="Ncm--n" msd="UPosTag=NOUN|Definite=Ind|Gender=Masc">�</w>

@matyaskopp
Copy link
Collaborator Author

Named entities

  • named entities contains non-proper names

I guess you are using a model that labels not only named entities from PER/LOC/ORG/MISC set but also DATE and probably other labels. Something like this: https://huggingface.co/dumitrescustefan/bert-base-romanian-ner
And you map all non-proper names to the MISC category, eg

<name type="MISC">
  <w xml:id="ParlaMint-RO_2000-10-24-id4980.u2.seg5.1.23" lemma="acel" pos="Dd3msr---e" msd="UPosTag=DET|Case=Acc,Nom|Gender=Masc|Number=Sing|Person=3|Position=Prenom|PronType=Dem">acel</w>
  <w xml:id="ParlaMint-RO_2000-10-24-id4980.u2.seg5.1.24" lemma="an" pos="Ncms-n" msd="UPosTag=NOUN|Definite=Ind|Gender=Masc|Number=Sing">an</w>
</name>

or

<name type="MISC">
  <w xml:id="ParlaMint-RO_2000-10-24-id4980.u2.seg5.1.30" lemma="1940" pos="Mc-s-d" msd="UPosTag=">1940</w>
</name>

The year 1940 is not a proper name, so it shouldn't be surrounded by <name>. It is better to use <date>
There are two options to solve this

  1. remove named entities that are not proper names (DATETIME, PERIOD, MONEY, QUANTITY, ...)
  2. find inspiration in the CZ corpus and use the proper tags. See mapping: update named-entity elements ufal/ParCzech#95 (comment)

We are under time pressure, so I suggest using option (1) for ParlaMint3.0, and you can possibly improve it in ParlaMint3.1 (create RO special taxonomy, use proper elements and add ana attribute)
@TomazErjavec ??

@matyaskopp
Copy link
Collaborator Author

shifted NEs ?

  • shifted NEs

In this paragraph (ParlaMint-RO_2000-10-24-id4980.u2.seg8.2), NEs seem to be shifted.
https://raw.githubusercontent.com/clarin-eric/ParlaMint/3f2d0a820d31aa7e55b72156089a3450b303e3bc/Data/ParlaMint-RO/ParlaMint-RO_2000-10-24-id4980.ana.xml
reformated and remove token elements (w and pc)

<s xml:id="ParlaMint-RO_2000-10-24-id4980.u2.seg8.2">
atitudinea autorităţilor ucrainene faţă de delegaţiile judeţului Suceava şi
<name type="MISC">Botoşani</name>
, la festivitatea dezvelirii
<name type="LOC">statuii</name>
lui
<name type="LOC">Eminescu</name>
, la Cernăuţi, în ziua de 15 iunie
<name type="LOC">2000</name>
; constrângerile
<name type="MISC">aduse în şcolile româneşti;</name>
coborârea unicului steag românesc de
<name type="MISC">pe</name>
clădirea sediului
<name type="LOC">redacţiei ziarului"</name>
Lumea"
<name type="MISC">;</name>
prezenţa la
<name type="MISC">manifestările româneşti a unor</name>
reprezentanţi gălăgioşi ai organizaţiilor
<name type="MISC">extremiste</name>
ucrainene; oprirea tinerilor etnici români,
<name type="MISC">în</name>
număr de
<name type="PER">200, de</name>
a veni la studii
<name type="MISC">în</name>
România, cu burse din partea statului
<name type="LOC">român</name>
şi altele.
</s>

@matyaskopp
Copy link
Collaborator Author

matyaskopp commented May 26, 2023

Voci din sală: in utterance

  • voice from the hall

https://github.com/romanian-parlamint/ParlaMint/blob/a510c149ba04407fe6df77414b3a2aaec6f47022/Data/ParlaMint-RO/ParlaMint-RO_2000-10-24-id4980.xml#L408-L414

<note type="speaker">Domnul Vasile Lupu:</note>
<u ana="#chair" who="#Vasile-Lupu" xml:id="ParlaMint-RO_2000-10-24-id4980.u37">
  <seg xml:id="ParlaMint-RO_2000-10-24-id4980.u37.seg1">Să vedem cine îl face. <vocal type="murmuring"><desc>(Rumoare în partea stângă a sălii)</desc></vocal> </seg>
  <seg xml:id="ParlaMint-RO_2000-10-24-id4980.u37.seg2">Dar, iată, se pare că nu s-a terminat şedinţa Biroului permanent.</seg>
  <seg xml:id="ParlaMint-RO_2000-10-24-id4980.u37.seg3">Voci din sală:</seg>
  <seg xml:id="ParlaMint-RO_2000-10-24-id4980.u37.seg4">S-a terminat de mult!</seg>
</u>

should be:

<note type="speaker">Domnul Vasile Lupu:</note>
<u ana="#chair" who="#Vasile-Lupu" xml:id="ParlaMint-RO_2000-10-24-id4980.u37">
  <seg xml:id="ParlaMint-RO_2000-10-24-id4980.u37.seg1">Să vedem cine îl face. <vocal type="murmuring"><desc>(Rumoare în partea stângă a sălii)</desc></vocal> </seg>
  <seg xml:id="ParlaMint-RO_2000-10-24-id4980.u37.seg2">Dar, iată, se pare că nu s-a terminat şedinţa Biroului permanent.</seg>
</u>
<note type="speaker">Voci din sală:</note>
<!-- no who attribute, ana is regular - expecting MP interrupting -->
<u ana="#regular" xml:id="ParlaMint-RO_2000-10-24-id4980.u38">
  <seg xml:id="ParlaMint-RO_2000-10-24-id4980.u38.seg1">S-a terminat de mult!</seg>
</u>

@matyaskopp
Copy link
Collaborator Author

person - affiliation - organization

  • parliamentary groups
  • only one virtual parliamentary group <orgName xml:lang="en" full="yes">Placeholder parliamentary group</orgName>
  • government

I guess you are aware of this. I just wanted it to be recorded

  INFO[10]  Total number of affiliations with RoParl: 256
  INFO[10]  Total number of affiliations with RoGov: 0
  Error: ERROR[10]  government-role organisation without affiliation: #RoGov
  INFO[10]  Total number of affiliations with RoParl.All: 0
  WARN[10]  parliamentaryGroup-role organisation without affiliation: #RoParl.All
  INFO[12]  Total number of organizations with parliament role: 1
  INFO[12]  Total number of organizations with government role: 1
  INFO[12]  Total number of organizations with parliamentaryGroup role: 1
  INFO[??]  Total number of affiliations 256
  INFO[??]  Total number of NO-role affiliations 0
  INFO[??]  Total number of 'member' role affiliations 256

@RePierre
Copy link
Collaborator

wrongly placed notes in the TEI.ana version

Fixed with commit 6662ec4.

@RePierre
Copy link
Collaborator

remove <0x0096> character

Removed in commit 69a116e.

@matyaskopp
Copy link
Collaborator Author

strange UPosTag _ when Mc-s-d

  • UPosTag of digit tokens Mc-s-d

Every token with pos="Mc-s-d" has wrong msd="UPosTag=_".
sample:

<w xml:id="ParlaMint-RO_2000-10-24-id4980.u2.seg5.1.2" 
   lemma="1990"
   pos="Mc-s-d"
   msd="UPosTag=_">1990</w>

You can fix this with msd="UPosTag=NUM" or msd="UPosTag=NUM|NumForm=Digit"

<w xml:id="ParlaMint-RO_2000-10-24-id4980.u2.seg5.1.2" 
   lemma="1990"
   pos="Mc-s-d"
   msd="UPosTag=NUM|NumForm=Digit">1990</w>

strange UPosTag _ when Mc-s-b

  • UPosTag of digit tokens Mc-s-b

Here I suggest replacing _ with X

cat DataForks/ParlaMint-RO/ParlaMint-RO_*.ana.xml| grep 'UPosTag=_"' | grep -v 'pos="Mc.s.d"'

<w xml:id="ParlaMint-RO_2006-09-18-id6154.u31.seg3.1.73" lemma="29,4" pos="Mc-s-b" msd="UPosTag=_">29,4</w>
<w xml:id="ParlaMint-RO_2006-09-18-id6154.u31.seg7.1.14" lemma="29,4" pos="Mc-s-b" msd="UPosTag=_">29,4</w>
<w xml:id="ParlaMint-RO_2006-09-18-id6154.u76.seg2.1.1" lemma="Mie" pos="Mc-s-b" msd="UPosTag=_">Mie</w>
<w xml:id="ParlaMint-RO_2006-09-18-id6154.u136.seg18.1.2" lemma="31.III.2006" pos="Mc-s-b" msd="UPosTag=_">31.III.2006</w>
<w xml:id="ParlaMint-RO_2006-09-18-id6154.u153.seg5.1.52" lemma="Secuiesc" pos="Mc-s-b" msd="UPosTag=_">Secuiesc</w>
<w xml:id="ParlaMint-RO_2015-09-29-id7560.u60.seg7.1.18" lemma="207;voturi" pos="Mc-s-b" msd="UPosTag=_">207;voturi</w>
<w xml:id="ParlaMint-RO_2015-10-12-id7569.u48.seg9.1.12" lemma="2003/88" pos="Mc-s-b" msd="UPosTag=_">2003/88</w>
<w xml:id="ParlaMint-RO_2015-10-12-id7569.u96.seg2.2.15" lemma="2002/772" pos="Mc-s-b" msd="UPosTag=_">2002/772</w>
<w xml:id="ParlaMint-RO_2015-10-12-id7569.u156.seg16.1.25" lemma="2007-2013" pos="Mc-s-b" msd="UPosTag=_">2007-2013</w>
<w xml:id="ParlaMint-RO_2018-03-05-id7900.u7.seg11.1.1" lemma="Mie" pos="Mc-s-b" msd="UPosTag=_">Mie</w>
<w xml:id="ParlaMint-RO_2018-03-05-id7900.u45.seg8.1.1" lemma="Mie" pos="Mc-s-b" msd="UPosTag=_">Mie</w>
<w xml:id="ParlaMint-RO_2021-10-25-id8335.u70.seg2.1.34" lemma="30.06.2021" pos="Mc-s-b" msd="UPosTag=_">30.06.2021</w>
<w xml:id="ParlaMint-RO_2021-10-25-id8335.u91.seg2.1.36" lemma="29A" pos="Mc-s-b" msd="UPosTag=_">29A</w>
<w xml:id="ParlaMint-RO_2021-10-25-id8335.u92.seg2.1.40" lemma="29A" pos="Mc-s-b" msd="UPosTag=_">29A</w>
<w xml:id="ParlaMint-RO_2021-10-25-id8335.u92.seg3.1.7" lemma="29A" pos="Mc-s-b" msd="UPosTag=_">29A</w>
<w xml:id="ParlaMint-RO_2021-10-25-id8335.u92.seg6.1.6" lemma="29A" pos="Mc-s-b" msd="UPosTag=_">29A</w>
<w xml:id="ParlaMint-RO_2021-10-25-id8335.u92.seg6.1.47" lemma="29A" pos="Mc-s-b" msd="UPosTag=_">29A</w>
<w xml:id="ParlaMint-RO_2021-10-25-id8335.u92.seg12.1.7" lemma="29A" pos="Mc-s-b" msd="UPosTag=_">29A</w>
<w xml:id="ParlaMint-RO_2021-10-25-id8335.u118.seg6.1.41" lemma="27.548" pos="Mc-s-b" msd="UPosTag=_">27.548</w>
<w xml:id="ParlaMint-RO_2021-10-25-id8335.u126.seg4.1.30" lemma="1.579/2006" pos="Mc-s-b" msd="UPosTag=_">1.579/2006</w>
<w xml:id="ParlaMint-RO_2021-11-09-id8341.u96.seg3.2.49" lemma="1,5°C" pos="Mc-s-b" msd="UPosTag=_">1,5°C</w>

@matyaskopp
Copy link
Collaborator Author

No join attribute

  • join="right" is missing in TEI.ana

see documentation: https://clarin-eric.github.io/ParlaMint/#sec-ana-words

@TomazErjavec
Copy link
Collaborator

As RO won't be a part of 3.1, moving this to "future" milestone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants