Question: Concatenating word parts at soft hyphens #90

pirolen · 2023-01-09T21:40:41Z

Hi and Happy New Year!

I wonder if there is a way to transform FoLiA linebreaks (<br/) into soft breaks with ucto, in case there is a soft hyphen sign ¬ at the end of the line in a FoLiA document. The goal is to access hyphenated word parts as single tokens.

I have untokenized data with lots of linebreaks as txt file, which I converted to FoLiA using piereling, e.g. as below.
Or perhaps the piereling converter resp. FoLiA-txt could already handle the soft breaks accordingly?

<p xml:id="TRAINING_VALIDATION_SET_Combined_VKS_2_Silvestrovskij_0_01_GT_softbreaks.text.p.1">
      <t>всѧцѣмь ѡбразомъ. аще<br/>виною аще ли истиною, хс҃<br/>проповѣдаємь єсть.<br/>и ѡ семь, раⷣуюсѧ, но и въ¬<br/>зрадоуюсѧ. вѣмь бѡ ꙗко<br/>се ми събоудетсѧ въ спс҃енїе.<br/>вашею мл҃твою и по данїю<br/>дх҃а їс҃ хв҃а. по чаанїю и оупо¬<br/>ванїю моємоу. тѡлкѡⷡ҇.<br/>Ѿ любве же рече соущаѧ о гд҃ѣ<br/>и ѡ мнѣ. вѣдоуще ꙗко въ<br/>ѿвѣтѣ лежоу блг҃овѣствⷪ¬<br/>ванїа. разоумѣша бѡ рече,<br/>ꙗко посланъ єсмь ѿ ба҃ пропо¬<br/>вѣдати єѵⷢ҇лїє. и ꙗко пра¬<br/>выню о семь дамъ. ѡбле¬<br/>гчать же ми ꙗже ѡ семь<br/>правынѧ. єже многы ѡгла¬<br/>сити, и тещи на проповѣдь.<br/>се оубѡ рече, видѣвше они.<br/>да ми ꙗже къ б҃гоу правынѧ<br/>ѡблежать, и ѡгласѧть.<br/>мнѡгомь словомь и пропо¬<br/>вѣдающе. что бѡ рече,<br/>длъго слово, ли что ми рече<br/>хощеть. кымь смотре¬<br/>нїємь кто проповѣдати.<br/>но да правѣ проповѣсть.<br/>и сиⷯ нѣцїи, несмысленїи и прї¬<br/>ꙗша. ꙗко всѣмь єресемъ<br/>преⷣпоутїє подаⷭ҇ апⷭ҇лъ. єже<br/>рещи аще виною, аще ли<br/>истиною възвѣщаємь<br/>бываєть. но ты въни¬<br/>маи. пръвоє оубѡ, не реч҇,<br/>да възвѣстисѧ. да се въ¬<br/>законити мниши. но про¬<br/>повѣдаєтсѧ, єже быва¬<br/>ємо исповѣдаємь. таче<br/>аще и възаконѧа рече. ибѡ<br/>мнѡзи єретици претвори¬<br/>ша писанїє. имоуще тако<br/>ха҃ да възвѣщенъ боудеⷮ.<br/>ниже тако поути єресемь<br/>предаⷭ҇. како; ꙗко ти оубѡ<br/></t>
    </p>

The text was updated successfully, but these errors were encountered:

proycon · 2023-01-10T15:41:46Z

Hi and Happy New Year!

Same to you!

I wonder if there is a way to transform FoLiA linebreaks (`<br/`) into soft breaks with ucto, in case there is a soft hyphen sign `¬` at the end of the line in a FoLiA document.

With ucto itself, no. You'd have to write a script (using foliapy ideally) that replaces instances of folia.Linebreak with [folia.Hyphbreak](https://folia.readthedocs.io/en/latest/hyphenation_annotation.html) whenever it sees a soft hyphen sign immediately preceeding it. Such a script would be usable/generic enough to add to [foliatools](https://github.com/proycon/foliatools). If you want out the quick & easy way (but sloppy so not withour risk), you could try just do something like: ```shell $ sed -e "s|¬ |<t-hbr/>|g' < yourfile.folia.xml ``` You'd then also have to manually insert a declaration for the hyphenation annotation you inserted otherwise parsers will complain: ```xml <hyphenation-annotation> <annotator processor="p1" /> </hyphenation-annotation> ``` And in the provenance section: ```xml <provenance> <processor xml:id="p1" name="pirolen" type="manual" /> </provenance> ```

The goal is to access hyphenated word parts as single tokens.

I'm not sure what the current state of hyphenation support is in ucto and if they will be recognised as single tokens as you'd indeed expect, @kosloot probably knows better.

`FoLiA-txt` could already handle the soft breaks accordingly?

If might output a non-spacing break character or strip it altogether ? I don't remember well. Best try and test to be sure.

proycon · 2023-01-10T20:40:19Z

Oops sorry, accidentally closed this probably a bit too prematurely

kosloot · 2023-01-12T11:04:10Z

I didn't have time to look into this a lot, but I think an import questions is also:
how did the `¬` symbols end up as in the FoLiA
When possible, these symbols should already be handles as special when creating the FoLiA.
FoLiA-abby does so, and creates a <t-hbr> (or at least it should....)

Maybe this can be improved.

pirolen · 2023-01-12T11:36:53Z

This is also what I remembered (but could not look it up till now in detail), that FoLiA-abby is able to handle the soft hyphen accordingly, so that the soft-hyphenated words are treated as single tokens by ucto afterwards.

In the text example above, the soft hyphens were hacked in by me for testing & illustration (replacing the original normal hyphens), but the txt --> FoLiA converter could not interpret them the same way as FoLiA-abby does (i.e., just put a   at the end of line anyway).

pirolen · 2023-01-12T11:40:53Z

Illustration of FoLiA-abby output (left), ucto-d afterwards (right):

kosloot · 2023-01-13T08:52:30Z

So FoLiA-abby does exact what you want. And ucto can work with that.
GOOD

Your made-up,example will indeed not be handled, by FoLiA-2text.
I maybe could hack in some feature to handle this. But that implies a major change in libfolia,
and might have yet unforeseen ramifications

SO: it would be better to NOT create this kind of FoLiA. Not by hand and not by any tool.

Leaves my question: You apparently have FoLiA files with soft hyphens. Which FoLiA tool did create those?
I would prefer to modify those tools.

pirolen · 2023-01-13T09:56:56Z

Thank you very much for looking into this. I have plain text files, and used piereling for the quick FoLiA conversion, but would be happy to use FoLiA-txt instead.

kosloot · 2023-01-13T11:28:34Z

[A, OK, it's getting clearer now.
So you have a tekst like:

Der Land¬
wirtschaft

And would like FoLiA-txt to output (fragment)

<p xml:id="hyp.p.1">
      <t class="FoLiA-txt">Der Landwirtschaft</t>
      <str xml:id="hyp.p.1.str.1">
        <t class="FoLiA-txt">Der</t>
      </str>
      <str xml:id="hyp.p.1.str.2">
        <t class="FoLiA-txt">Landwirtschaft</t>
      </str>
    </p>

This would be easy... The soft-hyphens is discarded.

OR Do you somehow keep information about the soft-hyphen in the FoLiA?
Which might also be possible, but lead to some very ugly discussions about how text-extraction should work in FoLiA documents. Probably including an exception to totally ignore soft-hyphens. but always? Sometimes? nasty

pirolen · 2023-01-13T12:11:56Z

Ideally, the line break information would be kept via the <t-hbr> tag in the untokenized FoLiA, and the converter would set this tag whenever it sees a soft linebreak symbol -- just like FoLiA-abby does, see my screenshot example above.

But I am also fine with the postporcessing script solution, as proycon suggested above.

kosloot · 2023-01-13T13:20:24Z

Ok,
one way to do so is:

    <p xml:id="hyp.p.1">
      <t class="FoLiA-txt">Der Landwirtschaft</t>
      <str xml:id="hyp.p.1.str.1">
        <t class="FoLiA-txt">Der</t>
      </str>
      <str xml:id="hyp.p.1.str.2">
        <t class="FoLiA-txt">Land<t-hbr>¬</t-hbr></t>
      </str>
      <str xml:id="hyp.p.1.str.3">
        <t class="FoLiA-txt">wirtschaft</t>
      </str>
    </p>

FoLia-2text (and folia2txt) will extract the text:
Der Landwirtschaft
based on the text node on the paragraph.
The <str> nodes are ignored for text extraction

This is probably OK, but it somehow feels 'odd' that the soft hyphen disappears in the  node.
But including it and having the  carry the text Der Land¬wirtschaft has very nasty consequences
for the FoLia2-text and folia2txt tools. Both will at the moment just leave the soft hypen in place, and removing it is not an easy task, with unclear consequences. @proycon input welcome!

As a side-note: It would be possible to implement some 'soft-hyphen handling' in Ucto. Discarding them totally.
Might be a new option, or as a build-in rule, or in the configuration using the [FILTER] rule. (which already filters out the Unicode 00AD soft hyphen)

A lot to think about after the weekend

pirolen · 2023-01-13T13:25:57Z

Based on my screenshot, FoLiA-abby replaces ¬ with </t-hbr>, doesn't it? (And ucto knows what to do with it.)
I.e., the 'hard hyphens' in the image of a text are interpreted and appear as ¬ in Abbyy's OCR XML.

kosloot · 2023-01-13T23:09:35Z

Yes, that's part of my point. I could learn FoLiA-txt the same trick. As suggested above.

<t-hbr>¬</t-hbr> is just a possibility to keep at least the information of the ¬ preserved. BUT SEE BELOW!

BUT: the discussion is about which text should be present in the  node.
Der Land<t-hbr/>wirtschaft</t>
vs.
Der Land<t-hbr>¬</t-hbr>wirtschaft</t>
vs
Der Landwirtschaft

At the moment, FoLiA-2text and folia2txt have different opinions about how to handle the second variant.
So it would be safe to take the Abby Road :)
Needs discussion with @proycon Probably as another text text-extraction problem in FoLiA. (there are many)

@proycon I discovered that the folia docs here state:
the hyphenised break is a softer break, only there for page formatting purposes. The hyphen symbol is by definition implied in its usage, so should never be explicitly incorporated in the text content.
That implies that my idea of including the ¬ is not desired. (I could use a class for this purpose)
Fine. But then it is maybe better to explicitly forbid this? Disallowing text-content inside a <t-hbr?
Both libfolia and FoLiAPY accept this construction, but FoLiAPY seems to ignore all embedded text, while libfolia preserves this. We should reach common ground here. (assumingly ignoring it)

kosloot · 2023-01-14T08:12:27Z

@proycon @pirolen : At the moment, FoLiA-abby has an --keephyphens option that includes the original hyphenation symbol in the FoLiA. Is this option really used? Because it should probably be removed or changed to use a class instead of text.
@pirolen What ramifications does removing/changing have for you?

pirolen · 2023-01-14T13:54:56Z

The presence of the hyphen needs to be kept traceable. If it is unambiguously replaced by the soft hyphen tag followed by the t-style tag, I guess that provides enough provenance information.

kosloot · 2023-01-16T12:30:33Z

The presence of the hyphen needs to be kept traceable.
That's why I introduced <t-hbr>¬</t-hbr> in FoLiA-abby.
But we learned that that is 'semi-illegal', so I need to come up with a different solution.

As such, <t-hbr/> could be enough to signal a soft hyphen. But there are more hyphens in the world.
Introducing a separate tag for every hyphen is the wrong path.
So we should take the 'class' road then.
Using something like: <t-hbr class="soft"/>

If it is unambiguously replaced by the soft hyphen tag followed by the t-style tag, ...
There is no direct relation between <t-hbr>` and <t-style>.
That they are adjacent in the FoLiA-abby output is a coincidence. Not some general FoliA property.

To summarize:

I suggest to modify FoLiA-abby to generate <t-hbr class="soft"/> tags, and to enhance FoLiA-txt to do the same.
I suggest to explicitly forbid text-content inside a <t-hbr> (see #text inside <t-hbr> nodes is allowed, but problematic proycon/foliapy#25 )

kosloot · 2023-01-16T14:27:13Z

@proycon Just an idea: I could also create nodes like:
<t-hbr class="¬"/> or <t-hbr class="-"/>
Could that be problematic?
We could create an (open?) set of hyphenation symbols to choose from.

proycon · 2023-01-16T16:16:26Z

That's why I introduced <t-hbr>¬</t-hbr> in FoLiA-abby.

Ah, I didn't realize you introduced it, I thought it was in the specification too, but indeed it isn't. In that case it may be easier to change the behaviour as only FoLiA-abby output is affects (which probably only affects @pirolen?)

Assigning classes to the hyphens would work yeah for distinguishing types.

kosloot · 2023-01-17T13:33:08Z

@pirolen I checked in a change to FoLiA-abby, to insert the hypens as a class in the <t-hbr> tags.
I also modified FoLiA-txt, to do the same trick. Replacing end-of-line hypens by <t-hbr> nodes.
Both in foliautils in Git.

I hope you have time to check this out.

pirolen · 2023-01-17T21:04:46Z

I am happy to test it, thanks!

Since our project runs on a machine that does not have LaMachine (and as I understand it is deprecated so better not install it), I tried to install foliautils as a Docker container doing as the instructions say

docker build -t proycon/foliautils --build-arg VERSION=development .

(Ignore the rest, I used the wrong Dockerfile)

proycon · 2023-01-17T21:43:38Z

That's the right procedure indeed. What problem did you run into with docker?

proycon · 2023-01-17T21:48:38Z

fyi, I just tried the docker build and it worked okay.

pirolen · 2023-01-17T21:50:49Z

Yes, managed to install it, now, thanks :-) Now trying to understand how to run FoLiA-txt on a specific file :-)

pirolen · 2023-01-17T21:53:05Z

I guess it works! in interactive mode :-))

pirolen · 2023-01-17T21:57:47Z

Shall I also install python-ucto as a container, or rather with pip? Does the container always need to run if I want to call it from a script?

Can ucto now deal with the new <t-hbr class="¬"/>?

proycon · 2023-01-17T22:00:06Z

python-ucto should just be installed via pip yes, ideally in a python virtualenv, no need for docker there. It will self-contain all dependencies nowadays.

pirolen · 2023-01-17T22:00:56Z

OK, cool. Can ucto now deal with the new <t-hbr class="¬"/>?

proycon · 2023-01-17T22:05:26Z

I'm not sure actually, @kosloot will know better. Ucto hasn't been changed in any case. It might already 'ignore' the t-hbr element and produce proper tokens?

kosloot · 2023-01-27T12:04:14Z

In fact, foliautils wasn't released yet, I think we should, right @kosloot?
Yes, I thought you already did that.
BUT: @pirolen wrote:
Hmm, not fully... there is a space problem at the end of lines. FoLiA-txt contatenates lines without a space, if I see it well.
So that needs investigation and a bugfix maybe.

proycon · 2023-01-27T12:09:56Z

Okay, I'll hold the release for a bit then, best to fix first indeed. I think she already was on the latest development foliautils anyway.

kosloot · 2023-01-27T12:32:17Z

@pirolen You wanted the ¬ symbol to be removed from the text, so textfragments ending with that should be appended anyway.
The idea behind FoLiA-txt is to create paragraphs, and allow ucto to do sentence detection on it. This works OK

Reconstructing or preserving the original layout is quite a hassle.
There is also NO 1-1 relation in formatting of a text processed with first FoLiA-txt and then FoLiA-2text.
Maybe we can do a bit better, but removing the ¬ from the FoLiA is irreversible.
Unless we add a new option in libfolia AND FoLiAPY to interpret the class in <t-hbr class="¬""/>
So we are stil chasing our own tail a bit.

pirolen · 2023-01-27T13:43:31Z

Lines that don’t end with the soft break are actually word boundaries, but a corresponding space is missing after conversion.

…

________________________________________ From: Ko van der Sloot ***@***.***> Sent: Friday, January 27, 2023 1:32:29 PM To: LanguageMachines/ucto Cc: Lendvai, Piroska; Mention Subject: Re: [LanguageMachines/ucto] Question: Concatenating word parts at soft hyphens (Issue #90) @pirolen<https://github.com/pirolen> You wanted the ¬ symbol to be removed from the text, so textfragments ending with that should be appended anyway. The idea behind FoLiA-txt is to create paragraphs, and allow ucto to do sentence detection on it. This works OK Reconstructing or preserving the original layout is quite a hassle. There is also NO 1-1 relation in formatting of a text processed with first FoLiA-txt and then FoLiA-2text. Maybe we can do a bit better, but removing the ¬ from the FoLiA is irreversible. Unless we add a new option in libfolia AND FoLiAPY to interpret the class in <t-hbr class="¬""/> So we are stil chasing our own tail a bit. — Reply to this email directly, view it on GitHub<#90 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ANLZWFMDSWJCLUKWJBEBZX3WUO553ANCNFSM6AAAAAATV5FKDU>. You are receiving this because you were mentioned.Message ID: ***@***.***>

kosloot · 2023-01-28T09:24:32Z

A yes. My bad. Stupid.
Will look into that next week

kosloot · 2023-01-30T09:20:57Z

I gave it some thought, ans it seems to be quite complex. But this is more a FoLiA issue then an ucto one. I will create a new issue there.

pirolen · 2023-03-15T16:32:08Z

validat

2. I took the resulting non-tokenized folia file, and tokenized it with the containerized ucto:

docker run -v /home/pirol/quanti/devel/foliatests:/data -t -i proycon/ucto -L rus --textclass FoLiA-txt <infile> <outfile>

Next, I wanted to print the tokens in the file using foliapy, but there seems to be a validation error on this outfile:

Just a note that the validation error persists, with the same error msg, also after pulling the new ucto image and using another file.

pirolen · 2023-03-15T16:57:47Z

Well, not using the dockers, but the GitHub versions, it seems to work for me

Do you mean cloning the repo, and then building from source?

If I run sh ./build-deps.sh I get

./bootstrap.sh: 64: automake: not found
./bootstrap.sh: 73: aclocal: not found
./bootstrap.sh: 86: autoreconf: not found

pirolen · 2023-09-13T21:33:56Z

Hi, I still have the problems pointed out, (A) using the docker ucto, or a docker file I built using this repo.

if I feed a .txt format file to it, it will not handle hyphens as supposed
if I feed a folia.xml file to it, the file it generates will produce a validation error (if running validation afterwards).

I tried to build the CLI ucto from this repo, but bash bootstrap.sh produces this:

aclocal: installing 'm4/pkg.m4' from '/usr/share/aclocal/pkg.m4'
configure.ac:6: installing './install-sh'
configure.ac:6: installing './missing'
src/Makefile.am:10: error: Libtool library used but 'LIBTOOL' is undefined
src/Makefile.am:10:   The usual way to define 'LIBTOOL' is to add 'LT_INIT'
src/Makefile.am:10:   to 'configure.ac' and run 'aclocal' and 'autoconf' again.
src/Makefile.am:10:   If 'LT_INIT' is in 'configure.ac', make sure
src/Makefile.am:10:   its definition is in aclocal's search path.
src/Makefile.am: installing './depcomp'
parallel-tests: installing './test-driver'
autoreconf: automake failed with exit status: 1

'LT_INIT' is indeed in 'configure.ac.
What else can I do?

proycon · 2023-09-14T07:48:56Z

Those errors point to a number of build dependencies missing on your system, try sudo apt-get install libxml2-dev libicu-dev libexttextcat-dev autoconf autoconf-archive automake libtool gcc g++ pkg-config make.

But since you already tried the containers, including one your built from the git repo, I wonder if ucto solves your issue at all. I think ko implemented most of the necessary changes in FoLiA-2text (part of foliautils), as mentioned in #90 (comment) .

kosloot · 2023-09-14T08:01:57Z

Besides form you missing a sound development environment.. (see comment above)

I'm pretty confused about what and how you are trying to accomplish.
But the right way to handle this is:

Provide a text with (soft)hyphens to FoLiAtxt
Run ucto on the resulting FoLiA

So NOT running ucto on the text directly!

If this still fails, than please give me the text (NOT a screenshot please) so I can look into it.

AND:

if I feed a folia.xml file to it, the file it generates will produce a validation error (if running validation afterwards).

Can you send me that file please, as this may be a real bug

pirolen · 2023-09-14T08:21:55Z

What I try to accomplish is pretty simple:

from a plain text file that has hyphenated lines, create a tokenized folia where the hyphens disappeared and the hyphenated word parts got joined.
I want to read in this file in FLAT (thus: validation is important).

There is a workaround I found:

plain text v1 --> text2folia (in containerized foliautils) --> folia xml
folia xml --> folia2text (in containerized foliautils)-> plain text v2 (hyphens disappear, word parts get joined)
plain text v2 --> containerized ucto --> folia xml

But I wonder if there is a simpler way to achieve it.
Attached is plain text v1...

The environment I use is on ubuntu 22, specifically for folia and its ecosystem, incl. containers. I do not typically compile software (in it), so it may miss tools and compilers that developers use. But this might hold for other end users too, and so far I did not miss these when installing/using other tools around folia (or elsewhere).

I did run bash bootstrap.sh in the env before trying to build ucto from the cloned github repo... that also gave some errors.

much thanks!

delepr_orig.txt

kosloot · 2023-09-14T08:46:12Z

First: The text file seems a bit mangled, or it contains strange character encodings, but nevertheless:

(DISCLAIMER: USING the most recent versions of all software)

$ FoLiA-txt -O bug delepr_orig.txt
$ folialint --nooutput bug/delepr_orig.folia.xml
Validated successfully: bug/delepr_orig.folia.xml
$ ucto -Lrus --inputclass=FoLiA-txt bug/delepr_orig.folia.xml bla.xml
folialint --nooutput bla.xml
Validated successfully: bla.xml

In this xml, you will find tokenized sentences without hyphens., like:

      <s xml:id="delepr_orig.p.1.s.2">
        <w xml:id="delepr_orig.p.1.s.2.w.1" class="WORD">
          <t>е҆ппа́</t>
        </w>
        <w xml:id="delepr_orig.p.1.s.2.w.2" class="WORD">
          <t>филипьскаго</t>
        </w>
        <w xml:id="delepr_orig.p.1.s.2.w.3" class="PUNCTUATION">
          <t>.</t>
        </w>
      </s>

Were a hyphen is removed from: филипьскаго

So no problems to see. What Am I missing?

pirolen · 2023-09-14T08:57:46Z

What Am I missing?

I can't use the CLI ucto since so far I was unable to compile it.

pirolen · 2023-09-14T08:58:32Z

And the container ucto produces invalid folia.

proycon · 2023-09-14T09:16:33Z

As far as I know all the fixes for this issue were already released a while back, so I tried this with the officially published containers (latest stable releases, so not latest git development. So you don't have to build any containers nor compile anything yourself and can just pull them from docker hub), and it all works fine:

$ docker run --rm -t -i -v .:/data proycon/foliautils FoLiA-txt -O bug delepr_orig.txt 
Processed: delepr_orig.txt into bug/delepr_orig.folia.xml still 0 files to go.
$ docker run --rm -t -i -v .:/data proycon/foliautils folialint --nooutput bug/delepr_orig.folia.xml
Validated successfully: bug/delepr_orig.folia.xml
$ docker run --rm -t -i -v .:/data proycon/ucto -Lrus --inputclass=FoLiA-txt bug/delepr_orig.folia.xml bla.xml
ucto: inputfile = bug/delepr_orig.folia.xml
ucto: outputfile = bla.xml
ucto: configured for languages: [rus]
$ docker run --rm -t -i -v .:/data proycon/foliautils folialint --nooutput bla.xml  
Validated successfully: bla.xml

Also verified with foliavalidator, since you want to load things in FLAT:

$ foliavalidator bla.xml 
Validated successfully: bla.xml

proycon · 2023-09-14T09:19:22Z

The environment I use is on ubuntu 22, specifically for folia and its ecosystem, incl. containers. I do not > typically compile software (in it), so it may miss tools and compilers that developers use.

Yeah, if you use the containers you don't need any further build tools on the system itself.

proycon · 2023-09-14T09:30:11Z

The text file seems a bit mangled, or it contains strange character encodings

It's valid unicode, but there seem to be some codepoints in the private use area, so they probably render only very specifically with certain dedicated fonts.

pirolen · 2023-09-14T09:50:37Z

It's valid unicode, but there seem to be some codepoints in the private use area, so they probably render only very specifically with certain dedicated fonts.

Thanks, it is indeed the case...
I suspect is not possible to render them in folia?

pirolen · 2023-09-14T09:55:46Z

Yeah, if you use the containers you don't need any further build tools on the system itself.

Sure, and I do use them, but the hyphenation is not tackled correctly by them.

As far as I know all the fixes for this issue were already released a while back, so I tried this with the officially published containers (latest stable releases, so not latest git development. So you don't have to build any containers nor compile anything yourself and can just pull them from docker hub), and it all works fine:

The problem is, hyphenation stays unresolved in the xml produced by the container, see screenshot (and ignore the not-rendered unicode).

Furthermore, sentences are not recognized from this file by FLAT.

If I produce a file based on my workaround above, both hyphenation and sentences are rendering correctly in FLAT.

kosloot · 2023-09-14T11:07:12Z

Well, maybe @proycon can check that in his environment? Because my XML has:

    <p xml:id="delepr_orig.p.1">
      <t class="FoLiA-txt">‌189r<br/>ст҃го́ меѳодиꙗ . е҆ппа́ фили<t-hbr>-</t-hbr>пьс
каго . къ и҆стелїю ѡ҆ прокаженїи ჻<br/>Ѿкѫдѹ , ѡ҇҆ є҆вⸯвѹлїе́ . не ꙗ҆вѣ ли ꙗ҆ко<br/>ѿ пѹ

so with <t-hbr>-</t-hbr> which is NOT used when using ucto or FoLiA-2text:

$ FoLiA-2text bla.xml
Processed :bla.xml into bla.xml.txt still 0 files to go.
$ more bla.xml.txt
189r ст҃го́ меѳодиꙗ . е҆ппа́ филипьскаго . къ и҆стелїю ѡ҆ прокаженїи ჻ Ѿкѫдѹ , ѡ҇҆ є҆вⸯв
.

proycon · 2023-09-14T11:51:11Z

Well, maybe @proycon can check that in his environment?

I have the same yes, <t-hbr>-</t-hbr> at that point in the sentence.

And at the sentence level they are gone (which is by design as I understand it):

<w xml:id="delepr_orig.p.1.s.2.w.2" class="WORD">
    <t>филипьскаго</t>
</w>

Furthermore, sentences are not recognized from this file by FLAT.

I think that is due to there being two text classes, the normal one (current), without the hyphens. and FoLiA-txt, the original one with the hyphens. If you still see the hyphens, then it's rendering the latter, but you want the former.

proycon · 2023-09-14T11:52:57Z

(I'm having trouble loading the file at all in FLAT currently, so there may be a bug in FLAT.)

proycon · 2023-09-14T11:58:58Z

Whilst "full perspective" failed for me, switching to "sentence" perspective did work:

In any case, let's open any flat issues in the flat issue tracker if they arise.

pirolen · 2023-09-14T13:08:32Z

I think that is due to there being two text classes, the normal one (current), without the hyphens. and FoLiA-txt, the original one with the hyphens. If you still see the hyphens, then it's rendering the latter, but you want the former.

Thanks!
Also, I overlooked one thing of the docker CLI, that one needs to provide input and output filename. Thanks for the CLI examples.

kosloot · 2023-09-15T13:32:13Z

I assume (hope) that this is settled now.
I will investigate the possibility to include the same hyphenation policy as an option in ucto.
Not sure how complex that would be.

proycon self-assigned this Jan 10, 2023

proycon closed this as completed Jan 10, 2023

proycon reopened this Jan 10, 2023

kosloot mentioned this issue Jan 30, 2023

Handling of different types of hypens in text. LanguageMachines/foliautils#67

Closed

pirolen mentioned this issue Sep 14, 2023

Loading ucto file: Gateway Time-out proycon/flat#187

Closed

kosloot mentioned this issue Sep 15, 2023

Implement (soft)hyphen handling in Ucto analogues to foliautils #92

Open

kosloot closed this as completed Mar 19, 2024

Question: Concatenating word parts at soft hyphens #90

Question: Concatenating word parts at soft hyphens #90

Comments

pirolen commented Jan 9, 2023

proycon commented Jan 10, 2023 via email • edited Loading

proycon commented Jan 10, 2023

kosloot commented Jan 12, 2023

pirolen commented Jan 12, 2023 • edited Loading

pirolen commented Jan 12, 2023

kosloot commented Jan 13, 2023 • edited Loading

pirolen commented Jan 13, 2023

kosloot commented Jan 13, 2023

pirolen commented Jan 13, 2023

kosloot commented Jan 13, 2023

pirolen commented Jan 13, 2023 • edited Loading

kosloot commented Jan 13, 2023 • edited Loading

kosloot commented Jan 14, 2023

pirolen commented Jan 14, 2023

kosloot commented Jan 16, 2023 • edited Loading

kosloot commented Jan 16, 2023

proycon commented Jan 16, 2023

kosloot commented Jan 17, 2023 • edited Loading

pirolen commented Jan 17, 2023 • edited Loading

proycon commented Jan 17, 2023

proycon commented Jan 17, 2023

pirolen commented Jan 17, 2023

pirolen commented Jan 17, 2023

pirolen commented Jan 17, 2023 • edited Loading

proycon commented Jan 17, 2023

pirolen commented Jan 17, 2023 • edited Loading

proycon commented Jan 17, 2023

kosloot commented Jan 27, 2023

proycon commented Jan 27, 2023

kosloot commented Jan 27, 2023

pirolen commented Jan 27, 2023 via email • edited Loading

kosloot commented Jan 28, 2023

kosloot commented Jan 30, 2023

pirolen commented Mar 15, 2023

pirolen commented Mar 15, 2023

pirolen commented Sep 13, 2023

proycon commented Sep 14, 2023

kosloot commented Sep 14, 2023 • edited Loading

pirolen commented Sep 14, 2023

kosloot commented Sep 14, 2023

pirolen commented Sep 14, 2023

pirolen commented Sep 14, 2023

proycon commented Sep 14, 2023 • edited Loading

proycon commented Sep 14, 2023 • edited Loading

proycon commented Sep 14, 2023

pirolen commented Sep 14, 2023

pirolen commented Sep 14, 2023

kosloot commented Sep 14, 2023

proycon commented Sep 14, 2023

proycon commented Sep 14, 2023

proycon commented Sep 14, 2023

pirolen commented Sep 14, 2023

kosloot commented Sep 15, 2023

proycon commented Jan 10, 2023 via email •

edited

Loading

pirolen commented Jan 12, 2023 •

edited

Loading

kosloot commented Jan 13, 2023 •

edited

Loading

pirolen commented Jan 13, 2023 •

edited

Loading

kosloot commented Jan 13, 2023 •

edited

Loading

kosloot commented Jan 16, 2023 •

edited

Loading

kosloot commented Jan 17, 2023 •

edited

Loading

pirolen commented Jan 17, 2023 •

edited

Loading

pirolen commented Jan 17, 2023 •

edited

Loading

pirolen commented Jan 17, 2023 •

edited

Loading

pirolen commented Jan 27, 2023 via email •

edited

Loading

kosloot commented Sep 14, 2023 •

edited

Loading

proycon commented Sep 14, 2023 •

edited

Loading

proycon commented Sep 14, 2023 •

edited

Loading