-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: Concatenating word parts at soft hyphens #90
Comments
Hi and Happy New Year!
Same to you!
I wonder if there is a way to transform FoLiA linebreaks (`<br/`) into soft breaks with ucto, in case there is a soft hyphen sign `¬` at the end of the line in a FoLiA document.
With ucto itself, no. You'd have to write a script (using foliapy
ideally) that replaces instances of folia.Linebreak with
[folia.Hyphbreak](https://folia.readthedocs.io/en/latest/hyphenation_annotation.html)
whenever it sees a soft hyphen sign immediately preceeding it.
Such a script would be usable/generic enough to add to
[foliatools](https://github.com/proycon/foliatools).
If you want out the quick & easy way (but sloppy so not withour risk), you could try just do something like:
```shell
$ sed -e "s|¬<br/>|<t-hbr/>|g' < yourfile.folia.xml
```
You'd then also have to manually insert a declaration for the hyphenation
annotation you inserted otherwise parsers will complain:
```xml
<hyphenation-annotation>
<annotator processor="p1" />
</hyphenation-annotation>
```
And in the provenance section:
```xml
<provenance>
<processor xml:id="p1" name="pirolen" type="manual" />
</provenance>
```
The goal is to access hyphenated word parts as single tokens.
I'm not sure what the current state of hyphenation support is in ucto
and if they will be recognised as single tokens as you'd indeed expect,
@kosloot probably knows better.
`FoLiA-txt` could already handle the soft breaks accordingly?
If might output a non-spacing break character or strip it altogether ? I don't remember well. Best try
and test to be sure.
|
Oops sorry, accidentally closed this probably a bit too prematurely |
I didn't have time to look into this a lot, but I think an import questions is also: Maybe this can be improved. |
This is also what I remembered (but could not look it up till now in detail), that FoLiA-abby is able to handle the soft hyphen accordingly, so that the soft-hyphenated words are treated as single tokens by ucto afterwards. In the text example above, the soft hyphens were hacked in by me for testing & illustration (replacing the original normal hyphens), but the txt --> FoLiA converter could not interpret them the same way as FoLiA-abby does (i.e., just put a |
So FoLiA-abby does exact what you want. And ucto can work with that. Your made-up,example will indeed not be handled, by FoLiA-2text. SO: it would be better to NOT create this kind of FoLiA. Not by hand and not by any tool. Leaves my question: You apparently have FoLiA files with soft hyphens. Which FoLiA tool did create those? |
Thank you very much for looking into this. I have plain text files, and used piereling for the quick FoLiA conversion, but would be happy to use |
[A, OK, it's getting clearer now.
And would like FoLiA-txt to output (fragment) <p xml:id="hyp.p.1">
<t class="FoLiA-txt">Der Landwirtschaft</t>
<str xml:id="hyp.p.1.str.1">
<t class="FoLiA-txt">Der</t>
</str>
<str xml:id="hyp.p.1.str.2">
<t class="FoLiA-txt">Landwirtschaft</t>
</str>
</p> This would be easy... The soft-hyphens is discarded. OR Do you somehow keep information about the soft-hyphen in the FoLiA? |
Ideally, the line break information would be kept via the But I am also fine with the postporcessing script solution, as proycon suggested above. |
Ok, <p xml:id="hyp.p.1">
<t class="FoLiA-txt">Der Landwirtschaft</t>
<str xml:id="hyp.p.1.str.1">
<t class="FoLiA-txt">Der</t>
</str>
<str xml:id="hyp.p.1.str.2">
<t class="FoLiA-txt">Land<t-hbr>¬</t-hbr></t>
</str>
<str xml:id="hyp.p.1.str.3">
<t class="FoLiA-txt">wirtschaft</t>
</str>
</p> FoLia-2text (and folia2txt) will extract the text: This is probably OK, but it somehow feels 'odd' that the soft hyphen disappears in the As a side-note: It would be possible to implement some 'soft-hyphen handling' in Ucto. Discarding them totally. A lot to think about after the weekend |
Based on my screenshot, FoLiA-abby replaces |
Yes, that's part of my point. I could learn FoLiA-txt the same trick. As suggested above.
BUT: the discussion is about which text should be present in the At the moment, FoLiA-2text and folia2txt have different opinions about how to handle the second variant. @proycon I discovered that the folia docs here state: |
@proycon @pirolen : At the moment, FoLiA-abby has an |
The presence of the hyphen needs to be kept traceable. If it is unambiguously replaced by the soft hyphen tag followed by the t-style tag, I guess that provides enough provenance information. |
As such,
To summarize:
|
@proycon Just an idea: I could also create nodes like: |
Ah, I didn't realize you introduced it, I thought it was in the specification too, but indeed it isn't. In that case it may be easier to change the behaviour as only FoLiA-abby output is affects (which probably only affects @pirolen?) Assigning classes to the hyphens would work yeah for distinguishing types. |
@pirolen I checked in a change to FoLiA-abby, to insert the hypens as a class in the I hope you have time to check this out. |
I am happy to test it, thanks! Since our project runs on a machine that does not have LaMachine (and as I understand it is deprecated so better not install it), I tried to install foliautils as a Docker container doing as the instructions say
(Ignore the rest, I used the wrong Dockerfile) |
That's the right procedure indeed. What problem did you run into with docker? |
fyi, I just tried the docker build and it worked okay. |
Yes, managed to install it, now, thanks :-) Now trying to understand how to run FoLiA-txt on a specific file :-) |
I guess it works! in interactive mode :-)) |
Shall I also install python-ucto as a container, or rather with pip? Does the container always need to run if I want to call it from a script? Can ucto now deal with the new |
python-ucto should just be installed via pip yes, ideally in a python virtualenv, no need for docker there. It will self-contain all dependencies nowadays. |
OK, cool. Can ucto now deal with the new |
I'm not sure actually, @kosloot will know better. Ucto hasn't been changed in any case. It might already 'ignore' the t-hbr element and produce proper tokens? |
|
Okay, I'll hold the release for a bit then, best to fix first indeed. I think she already was on the latest development foliautils anyway. |
@pirolen You wanted the Reconstructing or preserving the original layout is quite a hassle. |
Lines that don’t end with the soft break are actually word boundaries, but a corresponding space is missing after conversion.
…________________________________________
From: Ko van der Sloot ***@***.***>
Sent: Friday, January 27, 2023 1:32:29 PM
To: LanguageMachines/ucto
Cc: Lendvai, Piroska; Mention
Subject: Re: [LanguageMachines/ucto] Question: Concatenating word parts at soft hyphens (Issue #90)
@pirolen<https://github.com/pirolen> You wanted the ¬ symbol to be removed from the text, so textfragments ending with that should be appended anyway.
The idea behind FoLiA-txt is to create paragraphs, and allow ucto to do sentence detection on it. This works OK
Reconstructing or preserving the original layout is quite a hassle.
There is also NO 1-1 relation in formatting of a text processed with first FoLiA-txt and then FoLiA-2text.
Maybe we can do a bit better, but removing the ¬ from the FoLiA is irreversible.
Unless we add a new option in libfolia AND FoLiAPY to interpret the class in <t-hbr class="¬""/>
So we are stil chasing our own tail a bit.
—
Reply to this email directly, view it on GitHub<#90 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ANLZWFMDSWJCLUKWJBEBZX3WUO553ANCNFSM6AAAAAATV5FKDU>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
A yes. My bad. Stupid. |
I gave it some thought, ans it seems to be quite complex. But this is more a FoLiA issue then an ucto one. I will create a new issue there. |
Just a note that the validation error persists, with the same error msg, also after pulling the new ucto image and using another file. |
Do you mean cloning the repo, and then building from source? If I run ./bootstrap.sh: 64: automake: not found |
Hi, I still have the problems pointed out, (A) using the docker ucto, or a docker file I built using this repo.
I tried to build the CLI ucto from this repo, but
'LT_INIT' is indeed in 'configure.ac. |
Those errors point to a number of build dependencies missing on your system, try But since you already tried the containers, including one your built from the git repo, I wonder if ucto solves your issue at all. I think ko implemented most of the necessary changes in FoLiA-2text (part of foliautils), as mentioned in #90 (comment) . |
Besides form you missing a sound development environment.. (see comment above) I'm pretty confused about what and how you are trying to accomplish.
So NOT running ucto on the text directly! If this still fails, than please give me the text (NOT a screenshot please) so I can look into it. AND:
Can you send me that file please, as this may be a real bug |
What I try to accomplish is pretty simple:
There is a workaround I found:
But I wonder if there is a simpler way to achieve it. The environment I use is on ubuntu 22, specifically for folia and its ecosystem, incl. containers. I do not typically compile software (in it), so it may miss tools and compilers that developers use. But this might hold for other end users too, and so far I did not miss these when installing/using other tools around folia (or elsewhere). I did run much thanks! |
First: The text file seems a bit mangled, or it contains strange character encodings, but nevertheless: (DISCLAIMER: USING the most recent versions of all software) $ FoLiA-txt -O bug delepr_orig.txt In this xml, you will find tokenized sentences without hyphens., like: <s xml:id="delepr_orig.p.1.s.2">
<w xml:id="delepr_orig.p.1.s.2.w.1" class="WORD">
<t>е҆ппа́</t>
</w>
<w xml:id="delepr_orig.p.1.s.2.w.2" class="WORD">
<t>филипьскаго</t>
</w>
<w xml:id="delepr_orig.p.1.s.2.w.3" class="PUNCTUATION">
<t>.</t>
</w>
</s> Were a hyphen is removed from: So no problems to see. What Am I missing? |
I can't use the CLI ucto since so far I was unable to compile it. |
And the container ucto produces invalid folia. |
As far as I know all the fixes for this issue were already released a while back, so I tried this with the officially published containers (latest stable releases, so not latest git development. So you don't have to build any containers nor compile anything yourself and can just pull them from docker hub), and it all works fine:
Also verified with
|
Yeah, if you use the containers you don't need any further build tools on the system itself. |
It's valid unicode, but there seem to be some codepoints in the private use area, so they probably render only very specifically with certain dedicated fonts. |
Thanks, it is indeed the case... |
Well, maybe @proycon can check that in his environment? Because my XML has:
so with
|
I have the same yes, And at the sentence level they are gone (which is by design as I understand it): <w xml:id="delepr_orig.p.1.s.2.w.2" class="WORD">
<t>филипьскаго</t>
</w>
I think that is due to there being two text classes, the normal one (current), without the hyphens. and |
(I'm having trouble loading the file at all in FLAT currently, so there may be a bug in FLAT.) |
Thanks! |
I assume (hope) that this is settled now. |
Hi and Happy New Year!
I wonder if there is a way to transform FoLiA linebreaks (
<br/
) into soft breaks with ucto, in case there is a soft hyphen sign¬
at the end of the line in a FoLiA document. The goal is to access hyphenated word parts as single tokens.I have untokenized data with lots of linebreaks as txt file, which I converted to FoLiA using piereling, e.g. as below.
Or perhaps the piereling converter resp.
FoLiA-txt
could already handle the soft breaks accordingly?The text was updated successfully, but these errors were encountered: