-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrapping Bug for Spanish GNA #42
Comments
Tweak runtime MEESAGEs and a some verbs message to use full words for alternative GNA forms, instead of `$$` to join suffixes, because ALAN's line wrapping algorithms is buggy: when it counts words lengths to determine when to wrap, it doesn't consider the `$$` that might extend the word by joining it with the next string, thus truncating the words prematurely (see #42). Although the original code was slimmer and more idiomatic, the results were ugly, for it would wrap words like "puestos" at "puest", etc.
I will have a look at this. The 75 is mysterious, so at least I'm curious why that is. This only affects the command line interpreter, of course, since all GLK-terps have their own wrapping logic. ("Only" here only means "just" since we all use the command line interpreter for automated testing, so it is important.) Or, do you see this in WinArun or Gargoyle too? |
No only ARun for automated testing. I might use the GUI terps for visual inspection of text styles, and multimedia, since these can't be verified via command line, but these are not part of the test suite, strictly speaking.
From the way text wraps before The 75 mystery keep cropping up for some reason. I remember that there has already been a fix in the past for this, but it seems that the transition to UTF-8 has brought the problem back. So I wonder if encoding might be the culprit here, i.e. the counter not accounting for multi-byte characters in the source that then become single bytes in ISO. |
@thoni56, I've noticed an issue with how ALAN wraps transcripts.
Example, in
ponibles_test.a3t
the 's' of "puestos" is split on the next line:This is the library code that prints "puestos":
The above is a typical example of how the Spanish library handles gender and noun in various language constructs (adjectives, articles, verbs, etc.), by adding the 'a' or 'o' suffix depending on gender, and a final 's' if plural.
The problem seems to be that when the sentence reaches "puesto" (column 75) ALAN decides it's time to wrap without checking whether the upcoming text contains a
$$
(or punctuation) which might need to be joined with the current word (i.e. the one being parsed when ALAN decides to wrap).If I were to replace the above code with:
the output wouldn't be truncated prematurely. Apparently, ALAN sees the
$$
and waits before wrapping. The problem is that the above code variations is more verbose compared the to one being uses, because we only add the final 's' if the noun is plural (so no$$
on the previous vowel, in case there's not need to add a plurality 's').Probably I should add a proper minimum viable ad hoc test in the alan-bugs-testbed, but I wanted to mention it right away when I discovered it, and begin by posting here on ALAN i18n, since this affects the Spanish library and we all need to be aware of the issue and decide if it's worth using the longer code to prevent breaking the word.
Also, I'm not sure why ALAN is wrapping at 75, since I believe the default is 80 columns. I think this issue of incorrect wrapping already came up before, and was due to miscounting the various special
$
symbols in a way that affected columns book-keeping for when to wrap. But I thought that the problem had been solved already.In any case, this problem also affects punctuation, for I noticed in various transcripts that ALAN wraps lines just before a
.
,,
or)
(or other punctuation marks), which doesn't look nice either. I'm not sure if this is due to the presence of a$$
in the previous token or preceding the punctuation mark, but definitely ALAN should do some lookahead scrutiny before wrapping, to check that the next string "token" is not something that needs to be adjoined with the current one.From what I remember from peeking at the ALAN sources, the way output strings work in ALAN is a bit intricate, since some strings are retrieved from disk (those that are within quotes in the source) while others are taken from memory (those stored as attributes), and that the way these are handles is a bit complex due to Huffman compression — so the whole process is a very fragmented series of long jumps in C, where the various snippets that will form a string a retrieved as the AMachine munches code in real time.
I'm not sure where the part that handles wrapping falls in the process, but it looks like strings are truncated as they are being "stitched together", i.e. there's no "paragraphs buffer" where they are stored for later inspection-&-wrapping. I guess that probably adding some lookahead functionality to prevent cases like the above would require lot's of code changes.
The text was updated successfully, but these errors were encountered: