Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Greek transliteration is non deterministic #47

Open
hoschwenk opened this issue Dec 21, 2018 · 2 comments · May be fixed by #61
Open

Greek transliteration is non deterministic #47

hoschwenk opened this issue Dec 21, 2018 · 2 comments · May be fixed by #61
Labels

Comments

@hoschwenk
Copy link

Transliteration of Greek is non deterministic !
Running translit('Δεν του μίλησα ξανά.', 'el', reversed=True) several times
Gives "den toy milisa xana."
or "den tou milisa xana."
Maybe both are correct but the tool should always output the same one !
If not, results are not reproducible, e.g. when used in a machine translation system.

This happens if you start python3 several times. not when called in a loop

@barseghyanartur
Copy link
Owner

@hoschwenk

Thanks for bringing this up.

There have been numerous attempts and PRs to bring corectness to Greek transliteration.

I'm all open for correctness and thus willing to accept a valid PR.

I think back in the day, I have used this Wikipedia article as a valid and trustworthy source of information on the topic.

Could you please double check your findings with the mentioned Wikipedia article and let me know if current interpretation of transliterate isn't correct?

Thank you!

@akosiaris
Copy link
Contributor

I am unable to reproduce this on master (9333f24) and python 3.9.2

for i in `seq 1 10000` ; do python3 foo.py ; done | sort | uniq -c | sort -rn
   10000 Den toy milisa xana.

with foo.py containing

import transliterate

print(transliterate.translit('Δεν του μίλησα ξανά.', 'el', reversed=True))

This isn't easy to reproduce right now (which isn't surprising, 3 years have passed since 2019)

Judging from the report, I would say that we no longer are able to reproduce this cause starting with cpython 3.5 and finalized in the python spec in 3.7, standard dictionary objects preserve order. Given the following stanza in the pre_processor_mapping of the greek language

    u"Ou": u"Ου",
    u"ou": u"ου",
    u"Oy": u"Ου",
    u"oy": u"ου",

it makes sense that the dictionaries are initialized with different orders on subsequent executions in python version pre 3.5.

I 'd say that this explains the inconsistent behavior. It also means that by now it has become extremely rare and will only show up when using older and unsupported python versions.

However, the transliteration in the example above is just wrong.

I am not sure where the 2nd mapping comes from but it should not be there. ου in both ISO 843[1], the international ratification of ELOT 743 v1 with a couple of minor differences, and ELOT 743 version 2 type 1 [2] (the Greek cross ratification of ISO 843 to adopt the above minor differences) specifically set an exception for the double vowel ου, which needs to be transliterated as ou and vice versa. There is no mapping exception to/from oy, so while oy would be transliterated per the general rules to ου the inverse would never be true in a transliteration context (transcription, which favors pronunciation is a different story). It's important to note that nor the UN nor the ALA-LC (library of congress) treat ου differently than ISO-843/ELOT 743 v2 (which isn't the case for some other mappings).

@barseghyanartur I 'll submit a PR to remove the oy mapping to conform with the 2 standards (and also UN and ALA-LC). Let me know if you disagree. Incidentally that would also resolve this specific issue in older python versions.

[1] https://en.wikipedia.org/wiki/ISO_843
[2] https://sete.gr/files/Media/Egkyklioi/040707Latin-Greek.pdf

akosiaris added a commit to akosiaris/transliterate that referenced this issue May 5, 2022
"ου" in both ISO 843[1], the international ratification of ELOT 743 v1
with a couple of minor differences, and ELOT 743 version 2 type 1 [2]
(the Greek cross ratification of ISO 843 to adopt the above minor
differences) specifically set an exception for the double vowel "ου",
which needs to be transliterated as "ou" and vice versa. There is no
mapping exception to/from "oy", so while "oy" would be transliterated,
per the general rules, to "ου" the inverse would never be true in a
transliteration context.

It's important to note that nor the UN nor the ALA-LC
(library of congress) treat "ου" differently than ISO-843/ELOT 743 v2
(which isn't the case for some other mappings).

This closes barseghyanartur#47

Signed-off-by: Alexandros Kosiaris <[email protected]>
@akosiaris akosiaris linked a pull request May 5, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants