Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Concatenating word parts at soft hyphens #90

Closed
pirolen opened this issue Jan 9, 2023 · 77 comments
Closed

Question: Concatenating word parts at soft hyphens #90

pirolen opened this issue Jan 9, 2023 · 77 comments
Assignees

Comments

@pirolen
Copy link

pirolen commented Jan 9, 2023

Hi and Happy New Year!

I wonder if there is a way to transform FoLiA linebreaks (<br/) into soft breaks with ucto, in case there is a soft hyphen sign ¬ at the end of the line in a FoLiA document. The goal is to access hyphenated word parts as single tokens.

I have untokenized data with lots of linebreaks as txt file, which I converted to FoLiA using piereling, e.g. as below.
Or perhaps the piereling converter resp. FoLiA-txt could already handle the soft breaks accordingly?

<p xml:id="TRAINING_VALIDATION_SET_Combined_VKS_2_Silvestrovskij_0_01_GT_softbreaks.text.p.1">
      <t>всѧцѣмь ѡбразомъ. аще<br/>виною аще ли истиною, хс҃<br/>проповѣдаємь єсть.<br/>и ѡ семь, раⷣуюсѧ, но и въ¬<br/>зрадоуюсѧ. вѣмь бѡ ꙗко<br/>се ми събоудетсѧ въ спс҃енїе.<br/>вашею мл҃твою и по данїю<br/>дх҃а їс҃ хв҃а. по чаанїю и оупо¬<br/>ванїю моємоу. тѡлкѡⷡ҇.<br/>Ѿ любве же рече соущаѧ о гд҃ѣ<br/>и ѡ мнѣ. вѣдоуще ꙗко въ<br/>ѿвѣтѣ лежоу блг҃овѣствⷪ¬<br/>ванїа. разоумѣша бѡ рече,<br/>ꙗко посланъ єсмь ѿ ба҃ пропо¬<br/>вѣдати єѵⷢ҇лїє. и ꙗко пра¬<br/>выню о семь дамъ. ѡбле¬<br/>гчать же ми ꙗже ѡ семь<br/>правынѧ. єже многы ѡгла¬<br/>сити, и тещи на проповѣдь.<br/>се оубѡ рече, видѣвше они.<br/>да ми ꙗже къ б҃гоу правынѧ<br/>ѡблежать, и ѡгласѧть.<br/>мнѡгомь словомь и пропо¬<br/>вѣдающе. что бѡ рече,<br/>длъго слово, ли что ми рече<br/>хощеть. кымь смотре¬<br/>нїємь кто проповѣдати.<br/>но да правѣ проповѣсть.<br/>и сиⷯ нѣцїи, несмысленїи и прї¬<br/>ꙗша. ꙗко всѣмь єресемъ<br/>преⷣпоутїє подаⷭ҇ апⷭ҇лъ. єже<br/>рещи аще виною, аще ли<br/>истиною възвѣщаємь<br/>бываєть. но ты въни¬<br/>маи. пръвоє оубѡ, не реч҇,<br/>да възвѣстисѧ. да се въ¬<br/>законити мниши. но про¬<br/>повѣдаєтсѧ, єже быва¬<br/>ємо исповѣдаємь. таче<br/>аще и възаконѧа рече. ибѡ<br/>мнѡзи єретици претвори¬<br/>ша писанїє. имоуще тако<br/>ха҃ да възвѣщенъ боудеⷮ.<br/>ниже тако поути єресемь<br/>предаⷭ҇. како; ꙗко ти оубѡ<br/></t>
    </p>
@proycon proycon self-assigned this Jan 10, 2023
@proycon
Copy link
Member

proycon commented Jan 10, 2023 via email

@proycon proycon closed this as completed Jan 10, 2023
@proycon
Copy link
Member

proycon commented Jan 10, 2023

Oops sorry, accidentally closed this probably a bit too prematurely

@proycon proycon reopened this Jan 10, 2023
@kosloot
Copy link
Contributor

kosloot commented Jan 12, 2023

I didn't have time to look into this a lot, but I think an import questions is also:
how did the `¬` symbols end up as <br> in the FoLiA
When possible, these symbols should already be handles as special when creating the FoLiA.
FoLiA-abby does so, and creates a <t-hbr> (or at least it should....)

Maybe this can be improved.

@pirolen
Copy link
Author

pirolen commented Jan 12, 2023

This is also what I remembered (but could not look it up till now in detail), that FoLiA-abby is able to handle the soft hyphen accordingly, so that the soft-hyphenated words are treated as single tokens by ucto afterwards.

In the text example above, the soft hyphens were hacked in by me for testing & illustration (replacing the original normal hyphens), but the txt --> FoLiA converter could not interpret them the same way as FoLiA-abby does (i.e., just put a <br> at the end of line anyway).

@pirolen
Copy link
Author

pirolen commented Jan 12, 2023

Illustration of FoLiA-abby output (left), ucto-d afterwards (right):

Screen Shot 2023-01-12 at 12 38 35

@kosloot
Copy link
Contributor

kosloot commented Jan 13, 2023

So FoLiA-abby does exact what you want. And ucto can work with that.
GOOD

Your made-up,example will indeed not be handled, by FoLiA-2text.
I maybe could hack in some feature to handle this. But that implies a major change in libfolia,
and might have yet unforeseen ramifications

SO: it would be better to NOT create this kind of FoLiA. Not by hand and not by any tool.

Leaves my question: You apparently have FoLiA files with soft hyphens. Which FoLiA tool did create those?
I would prefer to modify those tools.

@pirolen
Copy link
Author

pirolen commented Jan 13, 2023

Thank you very much for looking into this. I have plain text files, and used piereling for the quick FoLiA conversion, but would be happy to use FoLiA-txt instead.

@kosloot
Copy link
Contributor

kosloot commented Jan 13, 2023

[A, OK, it's getting clearer now.
So you have a tekst like:

Der Land¬
wirtschaft

And would like FoLiA-txt to output (fragment)

<p xml:id="hyp.p.1">
      <t class="FoLiA-txt">Der Landwirtschaft</t>
      <str xml:id="hyp.p.1.str.1">
        <t class="FoLiA-txt">Der</t>
      </str>
      <str xml:id="hyp.p.1.str.2">
        <t class="FoLiA-txt">Landwirtschaft</t>
      </str>
    </p>

This would be easy... The soft-hyphens is discarded.

OR Do you somehow keep information about the soft-hyphen in the FoLiA?
Which might also be possible, but lead to some very ugly discussions about how text-extraction should work in FoLiA documents. Probably including an exception to totally ignore soft-hyphens. but always? Sometimes? nasty

@pirolen
Copy link
Author

pirolen commented Jan 13, 2023

Ideally, the line break information would be kept via the <t-hbr> tag in the untokenized FoLiA, and the converter would set this tag whenever it sees a soft linebreak symbol -- just like FoLiA-abby does, see my screenshot example above.

But I am also fine with the postporcessing script solution, as proycon suggested above.

@kosloot
Copy link
Contributor

kosloot commented Jan 13, 2023

Ok,
one way to do so is:

    <p xml:id="hyp.p.1">
      <t class="FoLiA-txt">Der Landwirtschaft</t>
      <str xml:id="hyp.p.1.str.1">
        <t class="FoLiA-txt">Der</t>
      </str>
      <str xml:id="hyp.p.1.str.2">
        <t class="FoLiA-txt">Land<t-hbr>¬</t-hbr></t>
      </str>
      <str xml:id="hyp.p.1.str.3">
        <t class="FoLiA-txt">wirtschaft</t>
      </str>
    </p>

FoLia-2text (and folia2txt) will extract the text:
Der Landwirtschaft
based on the text node on the paragraph.
The <str> nodes are ignored for text extraction

This is probably OK, but it somehow feels 'odd' that the soft hyphen disappears in the <p> node.
But including it and having the <p> carry the text Der Land¬wirtschaft has very nasty consequences
for the FoLia2-text and folia2txt tools. Both will at the moment just leave the soft hypen in place, and removing it is not an easy task, with unclear consequences. @proycon input welcome!

As a side-note: It would be possible to implement some 'soft-hyphen handling' in Ucto. Discarding them totally.
Might be a new option, or as a build-in rule, or in the configuration using the [FILTER] rule. (which already filters out the Unicode 00AD soft hyphen)

A lot to think about after the weekend

@pirolen
Copy link
Author

pirolen commented Jan 13, 2023

Based on my screenshot, FoLiA-abby replaces ¬ with </t-hbr>, doesn't it? (And ucto knows what to do with it.)
I.e., the 'hard hyphens' in the image of a text are interpreted and appear as ¬ in Abbyy's OCR XML.

@kosloot
Copy link
Contributor

kosloot commented Jan 13, 2023

Yes, that's part of my point. I could learn FoLiA-txt the same trick. As suggested above.

<t-hbr>¬</t-hbr> is just a possibility to keep at least the information of the ¬ preserved. BUT SEE BELOW!

BUT: the discussion is about which text should be present in the <p> node.
Der Land<t-hbr/>wirtschaft</t>
vs.
Der Land<t-hbr>¬</t-hbr>wirtschaft</t>
vs
Der Landwirtschaft

At the moment, FoLiA-2text and folia2txt have different opinions about how to handle the second variant.
So it would be safe to take the Abby Road :)
Needs discussion with @proycon Probably as another text text-extraction problem in FoLiA. (there are many)

@proycon I discovered that the folia docs here state:
the hyphenised break is a softer break, only there for page formatting purposes. The hyphen symbol is by definition implied in its usage, so should never be explicitly incorporated in the text content.
That implies that my idea of including the ¬ is not desired. (I could use a class for this purpose)
Fine. But then it is maybe better to explicitly forbid this? Disallowing text-content inside a <t-hbr?
Both libfolia and FoLiAPY accept this construction, but FoLiAPY seems to ignore all embedded text, while libfolia preserves this. We should reach common ground here. (assumingly ignoring it)

@kosloot
Copy link
Contributor

kosloot commented Jan 14, 2023

@proycon @pirolen : At the moment, FoLiA-abby has an --keephyphens option that includes the original hyphenation symbol in the FoLiA. Is this option really used? Because it should probably be removed or changed to use a class instead of text.
@pirolen What ramifications does removing/changing have for you?

@pirolen
Copy link
Author

pirolen commented Jan 14, 2023

The presence of the hyphen needs to be kept traceable. If it is unambiguously replaced by the soft hyphen tag followed by the t-style tag, I guess that provides enough provenance information.

@kosloot
Copy link
Contributor

kosloot commented Jan 16, 2023

The presence of the hyphen needs to be kept traceable.
That's why I introduced <t-hbr>¬</t-hbr> in FoLiA-abby.
But we learned that that is 'semi-illegal', so I need to come up with a different solution.

As such, <t-hbr/> could be enough to signal a soft hyphen. But there are more hyphens in the world.
Introducing a separate tag for every hyphen is the wrong path.
So we should take the 'class' road then.
Using something like: <t-hbr class="soft"/>

If it is unambiguously replaced by the soft hyphen tag followed by the t-style tag, ...
There is no direct relation between <t-hbr>` and <t-style>.
That they are adjacent in the FoLiA-abby output is a coincidence. Not some general FoliA property.

To summarize:

  1. I suggest to modify FoLiA-abby to generate <t-hbr class="soft"/> tags, and to enhance FoLiA-txt to do the same.
  2. I suggest to explicitly forbid text-content inside a <t-hbr> (see #text inside <t-hbr> nodes is allowed, but problematic proycon/foliapy#25 )

@kosloot
Copy link
Contributor

kosloot commented Jan 16, 2023

@proycon Just an idea: I could also create nodes like:
<t-hbr class="¬"/> or <t-hbr class="-"/>
Could that be problematic?
We could create an (open?) set of hyphenation symbols to choose from.

@proycon
Copy link
Member

proycon commented Jan 16, 2023

That's why I introduced <t-hbr>¬</t-hbr> in FoLiA-abby.

Ah, I didn't realize you introduced it, I thought it was in the specification too, but indeed it isn't. In that case it may be easier to change the behaviour as only FoLiA-abby output is affects (which probably only affects @pirolen?)

Assigning classes to the hyphens would work yeah for distinguishing types.

@kosloot
Copy link
Contributor

kosloot commented Jan 17, 2023

@pirolen I checked in a change to FoLiA-abby, to insert the hypens as a class in the <t-hbr> tags.
I also modified FoLiA-txt, to do the same trick. Replacing end-of-line hypens by <t-hbr> nodes.
Both in foliautils in Git.

I hope you have time to check this out.

@pirolen
Copy link
Author

pirolen commented Jan 17, 2023

I am happy to test it, thanks!

Since our project runs on a machine that does not have LaMachine (and as I understand it is deprecated so better not install it), I tried to install foliautils as a Docker container doing as the instructions say

docker build -t proycon/foliautils --build-arg VERSION=development .

(Ignore the rest, I used the wrong Dockerfile)

@proycon
Copy link
Member

proycon commented Jan 17, 2023

That's the right procedure indeed. What problem did you run into with docker?

@proycon
Copy link
Member

proycon commented Jan 17, 2023

fyi, I just tried the docker build and it worked okay.

@pirolen
Copy link
Author

pirolen commented Jan 17, 2023

Yes, managed to install it, now, thanks :-) Now trying to understand how to run FoLiA-txt on a specific file :-)

@pirolen
Copy link
Author

pirolen commented Jan 17, 2023

I guess it works! in interactive mode :-))

@pirolen
Copy link
Author

pirolen commented Jan 17, 2023

Shall I also install python-ucto as a container, or rather with pip? Does the container always need to run if I want to call it from a script?

Can ucto now deal with the new <t-hbr class="¬"/>?

@proycon
Copy link
Member

proycon commented Jan 17, 2023

python-ucto should just be installed via pip yes, ideally in a python virtualenv, no need for docker there. It will self-contain all dependencies nowadays.

@pirolen
Copy link
Author

pirolen commented Jan 17, 2023

OK, cool. Can ucto now deal with the new <t-hbr class="¬"/>?

@proycon
Copy link
Member

proycon commented Jan 17, 2023

I'm not sure actually, @kosloot will know better. Ucto hasn't been changed in any case. It might already 'ignore' the t-hbr element and produce proper tokens?

@kosloot
Copy link
Contributor

kosloot commented Jan 27, 2023

In fact, foliautils wasn't released yet, I think we should, right @kosloot?
Yes, I thought you already did that.
BUT: @pirolen wrote:
Hmm, not fully... there is a space problem at the end of lines. FoLiA-txt contatenates lines without a space, if I see it well.
So that needs investigation and a bugfix maybe.

@proycon
Copy link
Member

proycon commented Jan 27, 2023

Okay, I'll hold the release for a bit then, best to fix first indeed. I think she already was on the latest development foliautils anyway.

@kosloot
Copy link
Contributor

kosloot commented Jan 27, 2023

@pirolen You wanted the ¬ symbol to be removed from the text, so textfragments ending with that should be appended anyway.
The idea behind FoLiA-txt is to create paragraphs, and allow ucto to do sentence detection on it. This works OK

Reconstructing or preserving the original layout is quite a hassle.
There is also NO 1-1 relation in formatting of a text processed with first FoLiA-txt and then FoLiA-2text.
Maybe we can do a bit better, but removing the ¬ from the FoLiA is irreversible.
Unless we add a new option in libfolia AND FoLiAPY to interpret the class in <t-hbr class="¬""/>
So we are stil chasing our own tail a bit.

@pirolen
Copy link
Author

pirolen commented Jan 27, 2023 via email

@kosloot
Copy link
Contributor

kosloot commented Jan 28, 2023

A yes. My bad. Stupid.
Will look into that next week

@kosloot
Copy link
Contributor

kosloot commented Jan 30, 2023

I gave it some thought, ans it seems to be quite complex. But this is more a FoLiA issue then an ucto one. I will create a new issue there.

@pirolen
Copy link
Author

pirolen commented Mar 15, 2023

validat

2. I took the resulting non-tokenized folia file, and tokenized it with the containerized ucto:

docker run -v /home/pirol/quanti/devel/foliatests:/data -t -i proycon/ucto -L rus --textclass FoLiA-txt <infile> <outfile>

Next, I wanted to print the tokens in the file using foliapy, but there seems to be a validation error on this outfile:

Just a note that the validation error persists, with the same error msg, also after pulling the new ucto image and using another file.

@pirolen
Copy link
Author

pirolen commented Mar 15, 2023

Well, not using the dockers, but the GitHub versions, it seems to work for me

Do you mean cloning the repo, and then building from source?

If I run sh ./build-deps.sh I get

./bootstrap.sh: 64: automake: not found
./bootstrap.sh: 73: aclocal: not found
./bootstrap.sh: 86: autoreconf: not found

@pirolen
Copy link
Author

pirolen commented Sep 13, 2023

Hi, I still have the problems pointed out, (A) using the docker ucto, or a docker file I built using this repo.

  • if I feed a .txt format file to it, it will not handle hyphens as supposed
  • if I feed a folia.xml file to it, the file it generates will produce a validation error (if running validation afterwards).

I tried to build the CLI ucto from this repo, but bash bootstrap.sh produces this:

aclocal: installing 'm4/pkg.m4' from '/usr/share/aclocal/pkg.m4'
configure.ac:6: installing './install-sh'
configure.ac:6: installing './missing'
src/Makefile.am:10: error: Libtool library used but 'LIBTOOL' is undefined
src/Makefile.am:10:   The usual way to define 'LIBTOOL' is to add 'LT_INIT'
src/Makefile.am:10:   to 'configure.ac' and run 'aclocal' and 'autoconf' again.
src/Makefile.am:10:   If 'LT_INIT' is in 'configure.ac', make sure
src/Makefile.am:10:   its definition is in aclocal's search path.
src/Makefile.am: installing './depcomp'
parallel-tests: installing './test-driver'
autoreconf: automake failed with exit status: 1

'LT_INIT' is indeed in 'configure.ac.
What else can I do?

@proycon
Copy link
Member

proycon commented Sep 14, 2023

Those errors point to a number of build dependencies missing on your system, try sudo apt-get install libxml2-dev libicu-dev libexttextcat-dev autoconf autoconf-archive automake libtool gcc g++ pkg-config make.

But since you already tried the containers, including one your built from the git repo, I wonder if ucto solves your issue at all. I think ko implemented most of the necessary changes in FoLiA-2text (part of foliautils), as mentioned in #90 (comment) .

@kosloot
Copy link
Contributor

kosloot commented Sep 14, 2023

Besides form you missing a sound development environment.. (see comment above)

I'm pretty confused about what and how you are trying to accomplish.
But the right way to handle this is:

Provide a text with (soft)hyphens to FoLiAtxt
Run ucto on the resulting FoLiA

So NOT running ucto on the text directly!

If this still fails, than please give me the text (NOT a screenshot please) so I can look into it.

AND:

if I feed a folia.xml file to it, the file it generates will produce a validation error (if running validation afterwards).

Can you send me that file please, as this may be a real bug

@pirolen
Copy link
Author

pirolen commented Sep 14, 2023

What I try to accomplish is pretty simple:

  • from a plain text file that has hyphenated lines, create a tokenized folia where the hyphens disappeared and the hyphenated word parts got joined.
  • I want to read in this file in FLAT (thus: validation is important).

There is a workaround I found:

  • plain text v1 --> text2folia (in containerized foliautils) --> folia xml
  • folia xml --> folia2text (in containerized foliautils)-> plain text v2 (hyphens disappear, word parts get joined)
  • plain text v2 --> containerized ucto --> folia xml

But I wonder if there is a simpler way to achieve it.
Attached is plain text v1...

The environment I use is on ubuntu 22, specifically for folia and its ecosystem, incl. containers. I do not typically compile software (in it), so it may miss tools and compilers that developers use. But this might hold for other end users too, and so far I did not miss these when installing/using other tools around folia (or elsewhere).

I did run bash bootstrap.sh in the env before trying to build ucto from the cloned github repo... that also gave some errors.

much thanks!

delepr_orig.txt

@kosloot
Copy link
Contributor

kosloot commented Sep 14, 2023

First: The text file seems a bit mangled, or it contains strange character encodings, but nevertheless:

(DISCLAIMER: USING the most recent versions of all software)

$ FoLiA-txt -O bug delepr_orig.txt
$ folialint --nooutput bug/delepr_orig.folia.xml
Validated successfully: bug/delepr_orig.folia.xml
$ ucto -Lrus --inputclass=FoLiA-txt bug/delepr_orig.folia.xml bla.xml
folialint --nooutput bla.xml
Validated successfully: bla.xml

In this xml, you will find tokenized sentences without hyphens., like:

      <s xml:id="delepr_orig.p.1.s.2">
        <w xml:id="delepr_orig.p.1.s.2.w.1" class="WORD">
          <t>е҆ппа́</t>
        </w>
        <w xml:id="delepr_orig.p.1.s.2.w.2" class="WORD">
          <t>филипьскаго</t>
        </w>
        <w xml:id="delepr_orig.p.1.s.2.w.3" class="PUNCTUATION">
          <t>.</t>
        </w>
      </s>

Were a hyphen is removed from: филипьскаго

So no problems to see. What Am I missing?

@pirolen
Copy link
Author

pirolen commented Sep 14, 2023

What Am I missing?

I can't use the CLI ucto since so far I was unable to compile it.

@pirolen
Copy link
Author

pirolen commented Sep 14, 2023

And the container ucto produces invalid folia.

@proycon
Copy link
Member

proycon commented Sep 14, 2023

As far as I know all the fixes for this issue were already released a while back, so I tried this with the officially published containers (latest stable releases, so not latest git development. So you don't have to build any containers nor compile anything yourself and can just pull them from docker hub), and it all works fine:

$ docker run --rm -t -i -v .:/data proycon/foliautils FoLiA-txt -O bug delepr_orig.txt 
Processed: delepr_orig.txt into bug/delepr_orig.folia.xml still 0 files to go.
$ docker run --rm -t -i -v .:/data proycon/foliautils folialint --nooutput bug/delepr_orig.folia.xml
Validated successfully: bug/delepr_orig.folia.xml
$ docker run --rm -t -i -v .:/data proycon/ucto -Lrus --inputclass=FoLiA-txt bug/delepr_orig.folia.xml bla.xml
ucto: inputfile = bug/delepr_orig.folia.xml
ucto: outputfile = bla.xml
ucto: configured for languages: [rus]
$ docker run --rm -t -i -v .:/data proycon/foliautils folialint --nooutput bla.xml  
Validated successfully: bla.xml

Also verified with foliavalidator, since you want to load things in FLAT:

$ foliavalidator bla.xml 
Validated successfully: bla.xml

@proycon
Copy link
Member

proycon commented Sep 14, 2023

The environment I use is on ubuntu 22, specifically for folia and its ecosystem, incl. containers. I do not > typically compile software (in it), so it may miss tools and compilers that developers use.

Yeah, if you use the containers you don't need any further build tools on the system itself.

@proycon
Copy link
Member

proycon commented Sep 14, 2023

The text file seems a bit mangled, or it contains strange character encodings

It's valid unicode, but there seem to be some codepoints in the private use area, so they probably render only very specifically with certain dedicated fonts.

@pirolen
Copy link
Author

pirolen commented Sep 14, 2023

It's valid unicode, but there seem to be some codepoints in the private use area, so they probably render only very specifically with certain dedicated fonts.

Thanks, it is indeed the case...
I suspect is not possible to render them in folia?

@pirolen
Copy link
Author

pirolen commented Sep 14, 2023

Yeah, if you use the containers you don't need any further build tools on the system itself.

Sure, and I do use them, but the hyphenation is not tackled correctly by them.

As far as I know all the fixes for this issue were already released a while back, so I tried this with the officially published containers (latest stable releases, so not latest git development. So you don't have to build any containers nor compile anything yourself and can just pull them from docker hub), and it all works fine:

The problem is, hyphenation stays unresolved in the xml produced by the container, see screenshot (and ignore the not-rendered unicode).

Furthermore, sentences are not recognized from this file by FLAT.

If I produce a file based on my workaround above, both hyphenation and sentences are rendering correctly in FLAT.

Screenshot 2023-09-14 at 11 52 55

@kosloot
Copy link
Contributor

kosloot commented Sep 14, 2023

Well, maybe @proycon can check that in his environment? Because my XML has:

    <p xml:id="delepr_orig.p.1">
      <t class="FoLiA-txt">‌189r<br/>ст҃го́ меѳодиꙗ . е҆ппа́ фили<t-hbr>-</t-hbr>пьс
каго . къ и҆стелїю ѡ҆ прокаженїи ჻<br/>Ѿкѫдѹ , ѡ҇҆ є҆вⸯвѹлїе́ . не ꙗ҆вѣ ли ꙗ҆ко<br/>ѿ пѹ

so with <t-hbr>-</t-hbr> which is NOT used when using ucto or FoLiA-2text:

$ FoLiA-2text bla.xml
Processed :bla.xml into bla.xml.txt still 0 files to go.
$ more bla.xml.txt
189r ст҃го́ меѳодиꙗ . е҆ппа́ филипьскаго . къ и҆стелїю ѡ҆ прокаженїи ჻ Ѿкѫдѹ , ѡ҇҆ є҆вⸯв
.

@proycon
Copy link
Member

proycon commented Sep 14, 2023

Well, maybe @proycon can check that in his environment?

I have the same yes, <t-hbr>-</t-hbr> at that point in the sentence.

And at the sentence level they are gone (which is by design as I understand it):

<w xml:id="delepr_orig.p.1.s.2.w.2" class="WORD">
    <t>филипьскаго</t>
</w>

Furthermore, sentences are not recognized from this file by FLAT.

I think that is due to there being two text classes, the normal one (current), without the hyphens. and FoLiA-txt, the original one with the hyphens. If you still see the hyphens, then it's rendering the latter, but you want the former.

@proycon
Copy link
Member

proycon commented Sep 14, 2023

(I'm having trouble loading the file at all in FLAT currently, so there may be a bug in FLAT.)

@proycon
Copy link
Member

proycon commented Sep 14, 2023

Whilst "full perspective" failed for me, switching to "sentence" perspective did work:

flat_delepr_orig

In any case, let's open any flat issues in the flat issue tracker if they arise.

@pirolen
Copy link
Author

pirolen commented Sep 14, 2023

I think that is due to there being two text classes, the normal one (current), without the hyphens. and FoLiA-txt, the original one with the hyphens. If you still see the hyphens, then it's rendering the latter, but you want the former.

Thanks!
Also, I overlooked one thing of the docker CLI, that one needs to provide input and output filename. Thanks for the CLI examples.

@kosloot
Copy link
Contributor

kosloot commented Sep 15, 2023

I assume (hope) that this is settled now.
I will investigate the possibility to include the same hyphenation policy as an option in ucto.
Not sure how complex that would be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants