Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

text inside <t-hbr> nodes is allowed, but problematic #25

Open
kosloot opened this issue Jan 16, 2023 · 8 comments
Open

text inside <t-hbr> nodes is allowed, but problematic #25

kosloot opened this issue Jan 16, 2023 · 8 comments
Assignees
Labels
enhancement New feature or request

Comments

@kosloot
Copy link
Collaborator

kosloot commented Jan 16, 2023

Given this example:

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="hyp" generator="libfolia-v2.13" version="2.5.1">
  <metadata type="native">
    <annotations>
      <hyphenation-annotation set="tst"/>
      <style-annotation set="tst"/>
      <paragraph-annotation set="tst"/>
      <text-annotation set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/text.foliaset.ttl"/>
    </annotations>
  </metadata>
  <text xml:id="hyp.text">
    <p xml:id="hyp.p.1">
      <t class="FoLiA-txt">Appel<t-hbr>peren</t-hbr>taart</t>
    </p>
  </text>
</FoLiA>

folia2txt produces:
Appeltaart
Which is maybe consistent with the docs , although it is not explicitly forbidden
(should isn't may not)
FoLiA-2text gives:
Appelperentaart, so does interpret the embedded text.

My problem with this is, dat NOT including the text violates the principle of least surprise. By just looking in the text of the <p> you might expect the peren to show up.

The Best solution imho is to explicitly forbid a text content inside a <t-hbr> an give an error when it is attempted.

@proycon
Copy link
Owner

proycon commented Jan 16, 2023

I think the idea was that could contain the actual hyphenation symbol that was used in the text, reflecting that it is an actual symbol appearing in the text. You'd need it again if you'd want to serialize the text exactly as it was. If we'd forbid it entirely as you suggest (which makes things simpler indeed), we'd need to assume there is one generic hyphenation symbol (e.g. a hyphen) that we can use in case the user wants the 'original' text exactly.

By default, foliapy indeed resolves the <t-hbr>, I don't think there's currently an implementation that reconstructs the data including hyphens, but at least it's a theoretical possibility.

I'm also a bit hesistant to simplify this now after the fact as it'd break backward compatibility (I don't whether there a are a lot of documents out there using this, probably not, but can't be sure).

@kosloot
Copy link
Collaborator Author

kosloot commented Jan 16, 2023

I think the idea was that could contain the actual hyphenation symbol that was used in the text, reflecting that it is an actual symbol appearing in the text.

Which was exactly what I tried for FoLiA-abby, leading to this discussion.
As we already worked out, the hyphenation symbol can better be handled in the class feature.
Like: <t-hbr class="¬"/>

By default, foliapy indeed resolves the <t-hbr>,

Where `resolves' implies, IGNORING completely.
FoliaPY doesn't implement a feature to include it. And libfolia lacks a way of expressing to ignore
I suggest amending libfolia to ignore the text value (and maybe give a warning?) and leave a mechanism to include the content as a wish for both foliaPY and libfolia. For whenever the need comes up.

kosloot added a commit to LanguageMachines/libfolia that referenced this issue Jan 17, 2023
@proycon
Copy link
Owner

proycon commented Jan 17, 2023

Looks good yes!

@kosloot
Copy link
Collaborator Author

kosloot commented Jan 30, 2023

In hindsight, I think it is much better to NOT ABUSE the class feature for storing the actual hyphen.
I strongly suggest to put the hyphen as a value in the <t-hbr/>, as I previously did. e.g. <t-hbr>¬</t-hbr>
We then have to make an extra text-retrieving option to include that, otherwise ignored, text.
We need such an option anyway if we want to extract the value from the class.
Such an option is very easy to implement. (at least in libfolia).

That an evil user could use that value to store more than just a hyphen is not really a problem imho

As a bonus, we could allow text inside a <t-hspace>, for instance to store all kind of Unicode space characters, or sequences of more than one space. Those will be filtered out of the "normalized" text anyway.

@proycon
Copy link
Owner

proycon commented Feb 13, 2023

I agree with your assessment. What is in the text should be text, so putting them in the xml text was the good decision, as often these hyphens appear verbatim in the text. Even though by default we may want to ignore this text.

That an evil user could use that value to store more than just a hyphen is not really a problem imho

Agreed

As a bonus, we could allow text inside a , for instance to store all kind of Unicode space characters, or sequences of more than one space. Those will be filtered out of the "normalized" text anyway.

That makes sense yes. It does raise some difficulties due to the way XML deals with whitespace if the element were to be split over multiple lines.

@kosloot
Copy link
Collaborator Author

kosloot commented Feb 13, 2023

Ok, I will stick to that solution then.
``

It does raise some difficulties due to the way XML deals with whitespace if the element were to be split over multiple lines.
``

Well, I was referring to a <t-hspace> , which is documented te be for horizontal spacing. Not <whitespace> which should be used for vertical spacing.
I agree that a malicious user could smuggle linefeed or pagefeed characters into the <t> inside <t-hspace>.
So be it. In "normal use" (including standard text() extraction, we will discard them.
For the specialized text() extraction I made, which tries to restore as much as possible, they will be retrieved "as is",
which should be fine.

I will implement this and create some simple examples for testing

kosloot added a commit to LanguageMachines/libfolia that referenced this issue Feb 14, 2023
kosloot added a commit to LanguageMachines/foliautils that referenced this issue Feb 14, 2023
@kosloot
Copy link
Collaborator Author

kosloot commented Feb 15, 2023

I created a simple example to demonstrate what libfolia is capable of now:

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="markup" generator="libfolia-v2.14" version="2.5.1">
  <metadata type="native">
    <annotations>
      <sentence-annotation set="FoLiA-txt-set"/>
      <paragraph-annotation set="FoLiA-txt-set"/>
      <linebreak-annotation set="FoLiA-txt-set"/>
      <text-annotation set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/text.foliaset.ttl"/>
      <hyphenation-annotation set="FoLiA-txt-set"/>
      <hspace-annotation set="FoLiA-txt-set"/>
    </annotations>
  </metadata>
  <text xml:id="markup.text">
    <p xml:id="markup.p.1">
      <s xml:id="markup.p.1.s.1">
        <t>Dit is een test. Met enkele afgebroken zin<t-hbr>-</t-hbr><br space="no"/>nen. Dit is de eerste. Met ook een soft af<t-hbr>¬</t-hbr><br space="no"/>breking.<br/></t>
      </s>
      <s xml:id="markup.p.1.s.2">
        <t>En dit heeft een lange'<t-hspace>       </t-hspace>'space val.<br/></t>
      </s>
      <s xml:id="markup.p.1.s.3">
        <t>En dit heeft een enkele'<t-hspace/>'space val.<br/></t>
      </s>
      <s xml:id="markup.p.1.s.4">
        <t>En dit heeft een speciale'<t-hspace>⍽</t-hspace>'space val.<br/></t>
      </s>
    </p>
  </text>
</FoLiA>

Both folialint and foliavalidator accept this file. OK
Both FoLiA-2text and folia2txt parse this file and produce:

Dit is een test. Met enkele afgebroken zin
nen. Dit is de eerste. Met ook een soft af
breking.
En dit heeft een lange' 'space val.
En dit heeft een enkele' 'space val.
En dit heeft een speciale' 'space val.

NOTE @proycon : folia2text slides in some leading spaces, after the first sentence. A minor bug?
Besides that: OK

BUT FoLiA-2text can now also produce, (using --restore-formatting):

Dit is een test. Met enkele afgebroken zin-
nen. Dit is de eerste. Met ook een soft af¬
breking.
En dit heeft een lange'       'space val.
En dit heeft een enkele' 'space val.
En dit heeft een speciale'⍽'space val.

Which is very cool, I think.

Second NOTE: This example is handcrafted, at the moment there is no DIRECT way to create a FoLiA file with "special" <t-hspace> nodes. The only tool that I have that does this, is FoLiA-abby.

@kosloot
Copy link
Collaborator Author

kosloot commented Feb 15, 2023

@pirolen Please note: We reverted back to the solution before January 17th.
So Abby files produced with the GIT version are not usable (although still valid)
The implementation should now be OK (but needs more testing)
Also FoLiA-2text seems to work OK with the new --restore-formatting option (as shown above)

sorry for the inconvenience.
Hope you can test the Git version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants