Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can I disable text validation? #26

Open
stelmath opened this issue Jul 15, 2023 · 4 comments
Open

Can I disable text validation? #26

stelmath opened this issue Jul 15, 2023 · 4 comments

Comments

@stelmath
Copy link

stelmath commented Jul 15, 2023

Hello and thanks for the repo,

I think I have an issue with text validation. I have created a folia Doc with sentence elements and their text. I am machine translating it and then reading the machine translated xml as a new folia Document. No issue so far, then I am creating a new folia Document that has div elements and inside it I am appending sentence pairs, one from each of those two folia Documents (the original one and the one that I machine translated). But, when I try to append this sentence from the machine translated Document I get:

folia.main.InconsistentText: Text for <Sentence at 140523773674208 id=source_segment_EN_7 set=None class=None>, is inconsistent: EXPECTED (deep text after normalization) *****>
[ date ]
****> BUT FOUND (strict text after normalization) ****>
[date]
******* DEVIATION POINT: [<*HERE*>date]
(also checked against older rules prior to FoLiA v2.4.1)

Basically my sentence is :

<s xml:id="source_segment_EN_7">
      <t>
        <t-style class="bold">[date]</t-style>
      </t>
</s>

which results in the above error. I don't understand what the term deep text and strict text means but can I simply turn off this validation and let this sentence be appended in a div? Thanks

@proycon
Copy link
Owner

proycon commented Jul 17, 2023

Text validation is a rather essential component that ensures your FoLiA document is valid, so it always enabled.

I wonder what causes your error, do you also have a <t> element under your
<div>? The error suggests that the text there is [ date ] (with spaces). If
text is specified multiple times on multiple levels (text redundancy), it can't
be different on a deeper level than on the higher level; text validation

@stelmath
Copy link
Author

stelmath commented Jul 17, 2023

the <div> in which I want to insert the above <s> is this:

<div xml:id="segment_id_7">
      <s xml:id="source_segment_7">
        <t>
          <t-style class="bold">[datum]</t-style>
        </t>
      </s>
</div>

Essentially, I am looking to pair up source/translation sentences and their tokens/POS tags inside another element. That element could be a <div> or something else, like a <p>, I don't really mind. Is there a better way to do this?

The [ date ] text that is mentioned can't be found in my Folia document:

<?xml version='1.0' encoding='utf-8'?>
<FoLiA xmlns="http://ilk.uvt.nl/folia" xmlns:xlink="http://www.w3.org/1999/xlink" xml:id="task_id_1" version="2.5.1" generator="foliapy-v2.5.8">
  <metadata type="native">
    <annotations>
      <text-annotation set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/text.foliaset.ttl"/>
      <pos-annotation set="https://web.archive.org/web/20190206204307/https://www.clips.uantwerpen.be/pages/mbsp-tags">
        <annotator processor="proc.spacy.0ed0c64d"/>
      </pos-annotation>
      <style-annotation set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/styles.foliaset.xml">
        <annotator processor="proc.internal_formatting.0124964d"/>
      </style-annotation>
      <spanrelation-annotation>
        <annotator processor="proc.word_alignment.64f4b459"/>
      </spanrelation-annotation>
      <division-annotation/>
    </annotations>
    <provenance>
      <processor xml:id="proc.spacy.0ed0c64d" name="spacy" type="auto">
        <processor xml:id="proc.spacy.0ed0c64d.generator" name="foliapy" type="generator" version="2.5.8" folia_version="2.5.1"/>
      </processor>
      <processor xml:id="proc.internal_formatting.0124964d" name="internal_formatting" type="auto">
        <processor xml:id="proc.internal_formatting.0124964d.generator" name="foliapy" type="generator" version="2.5.8" folia_version="2.5.1"/>
      </processor>
      <processor xml:id="proc.word_alignment.64f4b459" name="word_alignment" type="auto">
        <processor xml:id="proc.word_alignment.64f4b459.generator" name="foliapy" type="generator" version="2.5.8" folia_version="2.5.1"/>
      </processor>
    </provenance>
  </metadata>
  <text xml:id="task_id_1.text.1">
    <div xml:id="segment_id_1">
      <s xml:id="source_segment_1">
        <t>CONCEPT</t>
      </s>
      <s xml:id="source_segment_EN_1">
        <t>CONCEPT</t>
        <w xml:id="segment_id_1.w.1">
          <t>CONCEPT</t>
          <pos class="NN"/>
        </w>
      </s>
    </div>
    <div xml:id="segment_id_2">
      <s xml:id="source_segment_2">
        <t>Ministerie van Infrastructuur en Waterstaat</t>
      </s>
      <s xml:id="source_segment_EN_2">
        <t>Ministry of Infrastructure and Water Management</t>
        <w xml:id="segment_id_2.w.1">
          <t>Ministry</t>
          <pos class="NNP"/>
        </w>
        <w xml:id="segment_id_2.w.2">
          <t>of</t>
          <pos class="IN"/>
        </w>
        <w xml:id="segment_id_2.w.3">
          <t>Infrastructure</t>
          <pos class="NNP"/>
        </w>
        <w xml:id="segment_id_2.w.4">
          <t>and</t>
          <pos class="CC"/>
        </w>
        <w xml:id="segment_id_2.w.5">
          <t>Water</t>
          <pos class="NNP"/>
        </w>
        <w xml:id="segment_id_2.w.6">
          <t>Management</t>
          <pos class="NNP"/>
        </w>
      </s>
    </div>
    <div xml:id="segment_id_3">
      <s xml:id="source_segment_3">
        <t>Pagina  van </t>
      </s>
      <s xml:id="source_segment_EN_3">
        <t>Page of </t>
        <w xml:id="segment_id_3.w.1">
          <t>Page</t>
          <pos class="NN"/>
        </w>
        <w xml:id="segment_id_3.w.2">
          <t>of</t>
          <pos class="IN"/>
        </w>
      </s>
    </div>
    <div xml:id="segment_id_4">
      <s xml:id="source_segment_4">
        <t>Ministerie van Infrastructuur en Waterstaat</t>
      </s>
      <s xml:id="source_segment_EN_4">
        <t>Ministry of Infrastructure and Water Management</t>
        <w xml:id="segment_id_4.w.1">
          <t>Ministry</t>
          <pos class="NNP"/>
        </w>
        <w xml:id="segment_id_4.w.2">
          <t>of</t>
          <pos class="IN"/>
        </w>
        <w xml:id="segment_id_4.w.3">
          <t>Infrastructure</t>
          <pos class="NNP"/>
        </w>
        <w xml:id="segment_id_4.w.4">
          <t>and</t>
          <pos class="CC"/>
        </w>
        <w xml:id="segment_id_4.w.5">
          <t>Water</t>
          <pos class="NNP"/>
        </w>
        <w xml:id="segment_id_4.w.6">
          <t>Management</t>
          <pos class="NNP"/>
        </w>
      </s>
    </div>
    <div xml:id="segment_id_5">
      <s xml:id="source_segment_5">
        <t>HOOFDDIRECTIE  BESTUURLIJKE EN JURIDISCHE ZAKEN</t>
      </s>
      <s xml:id="source_segment_EN_5">
        <t>MAIN DIRECTORATE OF ADMINISTRATIVE AND LEGAL AFFAIRS</t>
        <w xml:id="segment_id_5.w.1">
          <t>MAIN</t>
          <pos class="NNP"/>
        </w>
        <w xml:id="segment_id_5.w.2">
          <t>DIRECTORATE</t>
          <pos class="NN"/>
        </w>
        <w xml:id="segment_id_5.w.3">
          <t>OF</t>
          <pos class="IN"/>
        </w>
        <w xml:id="segment_id_5.w.4">
          <t>ADMINISTRATIVE</t>
          <pos class="JJ"/>
        </w>
        <w xml:id="segment_id_5.w.5">
          <t>AND</t>
          <pos class="CC"/>
        </w>
        <w xml:id="segment_id_5.w.6">
          <t>LEGAL</t>
          <pos class="JJ"/>
        </w>
        <w xml:id="segment_id_5.w.7">
          <t>AFFAIRS</t>
          <pos class="NNS"/>
        </w>
      </s>
    </div>
    <div xml:id="segment_id_6">
      <s xml:id="source_segment_6">
        <t>Pagina  van </t>
      </s>
      <s xml:id="source_segment_EN_6">
        <t>Page of </t>
        <w xml:id="segment_id_6.w.1">
          <t>Page</t>
          <pos class="NN"/>
        </w>
        <w xml:id="segment_id_6.w.2">
          <t>of</t>
          <pos class="IN"/>
        </w>
      </s>
    </div>
    <div xml:id="segment_id_7">
      <s xml:id="source_segment_7">
        <t>
          <t-style class="bold">[datum]</t-style>
        </t>
      </s>
    </div>
  </text>
</FoLiA>

@stelmath
Copy link
Author

To make it simpler, I have isolated the sentence that causes the issue. Here is a debug.xml document:

<?xml version='1.0' encoding='utf-8'?>
<FoLiA xmlns="http://ilk.uvt.nl/folia" xmlns:xlink="http://www.w3.org/1999/xlink" xml:id="task_id_1" version="2.5.1" generator="foliapy-v2.5.8">
  <metadata type="native">
    <annotations>
      <text-annotation set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/text.foliaset.ttl"/>
      <pos-annotation set="https://web.archive.org/web/20190206204307/https://www.clips.uantwerpen.be/pages/mbsp-tags">
        <annotator processor="proc.spacy.d01e0445"/>
      </pos-annotation>
      <style-annotation set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/styles.foliaset.xml">
        <annotator processor="proc.internal_formatting.452d069a"/>
      </style-annotation>
      <spanrelation-annotation>
        <annotator processor="proc.word_alignment.2ff3e2c3"/>
      </spanrelation-annotation>
      <sentence-annotation/>
      <token-annotation/>
    </annotations>
    <provenance>
      <processor xml:id="proc.spacy.d01e0445" name="spacy" type="auto">
        <processor xml:id="proc.spacy.d01e0445.generator" name="foliapy" type="generator" version="2.5.8" folia_version="2.5.1"/>
      </processor>
      <processor xml:id="proc.internal_formatting.452d069a" name="internal_formatting" type="auto">
        <processor xml:id="proc.internal_formatting.452d069a.generator" name="foliapy" type="generator" version="2.5.8" folia_version="2.5.1"/>
      </processor>
      <processor xml:id="proc.word_alignment.2ff3e2c3" name="word_alignment" type="auto">
        <processor xml:id="proc.word_alignment.2ff3e2c3.generator" name="foliapy" type="generator" version="2.5.8" folia_version="2.5.1"/>
      </processor>
    </provenance>
  </metadata>
  <text xml:id="task_id_1.text.1">
  <s xml:id="segment_id_7">
      <t>
        <t-style class="bold">[date]</t-style>
      </t>
      <w xml:id="segment_id_7.w.1">
        <t>[</t>
        <pos class="XX"/>
      </w>
      <w xml:id="segment_id_7.w.2">
        <t>date</t>
        <pos class="XX"/>
      </w>
      <w xml:id="segment_id_7.w.3">
        <t>]</t>
        <pos class="XX"/>
      </w>
    </s>
  </text>
</FoLiA>
import folia.main as folia

folia = folia.Document(file='debug.xml')

@kosloot
Copy link
Collaborator

kosloot commented Jul 29, 2023

There are 2 possible solutions here:

  1. fix the words, by adding the space="no" atribute for words 1 and 2.
  2. fix the sentence by adding spaces after the [ and after date

So either:

      <w xml:id="segment_id_7.w.2" space="no">
        <t>date</t>
        <pos class="XX"/>
      </w>
      <w xml:id="segment_id_7.w.3">
        <t>]</t>
        <pos class="XX"/>
      </w>

or:

      <t>
        <t-style class="bold">[ date ]</t-style>
      </t>

The point is, that spaces are implicit after EVERY word

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants