Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add some localizations #8

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

add some localizations #8

wants to merge 3 commits into from

Conversation

Konfekt
Copy link

@Konfekt Konfekt commented May 19, 2024

Rudimentary support en, de, pt, es, fr, it, ... for the time being, hopefully without any clashes.
Thank you for your useful tool

rudimentary support en, de, pt, es, fr, it, ... for the time being, hopefully without any clashes
@ap
Copy link
Owner

ap commented May 19, 2024

Hi, thank you for the patch. I have some questions.

On its own terms, I don’t understand the (?<!q&)und, (?<!q&)y, and (?<!q&)à bits. I don’t think I’ve ever seen “Q&Und” in German text, much less it also needing to be treated differently from just “und”. I suspect the same is true for “Q&À” vs “à” in French and “Q&Y” vs “y” in Spanish?

I also don’t know how I feel about this patch in general. You’ll note – although the code is structured in a way that people who merely want to use the program can ignore this – that there is a CPAN directory whereby the same code lives on CPAN as a module: a module called Lingua::EN::Titlecase::Simple. Note the language in the moniker – this code is explicitly English-centric (and a very specific and narrow English-speaking sensibility at that: the NYT style manual). I dealt with this code too long ago to remember the details but I believe I recall that some of the rules implemented by the rest of the code do not well apply to German as a German speaker would expect it to look. I don’t have the same feel for the other languages, but on the basis of the one I do know, I would be surprised to hear that the same is not true for any of the others. So I am trying to understand the circumstances that make this patch useful. Is the idea here to process mostly English text for an English audience which just happens to contain stray foreign words or other-language quotations or other such embeds? Even that seems a strange guess, but I can’t think of anything better.

@Konfekt
Copy link
Author

Konfekt commented May 19, 2024

It's just that it seemed not to work with texts in other languages, like German or Spanish;
it's rudimentary, but it could still be useful, for example, as a filter in Vim.
Possibly pardon my ignorance, I just thought there aren't any title case shell scripts for other languages, so here's a first tentative step.
Regarding the (?<!q&), these are nonsense. I removed them.

@ap
Copy link
Owner

ap commented May 19, 2024

It's just that it seemed not to work with texts in other languages, like German or Spanish

The reason there is no such script at least for German is that in German, capitalization is part of spelling and grammar, so arbitrarily titlecasing German text this way simply makes it incorrect. Headlines are either cased the same as regular text or set in all caps; titlecasing is not a thing.

Sorry, I should have been fully awake before my first response and noticed this fundamental issue before getting bogged down with details like the (?<!q&) bit.

I don’t know whether the same is true for Spanish or any of the other languages in your patch – I suspect it is, because I know it is also true for Greek, and therefore not a peculiarity of German in particular… but I don’t know know it for a fact. Do you?

I just thought there aren't any title case shell scripts for other languages, so here's a first tentative step.

Is that observation the only basis though? Do you not speak any of those languages? (Or write or edit for audiences reading them, or whatever direct knowledge.) I do appreciate the sentiment of wanting to help, and I appreciate that you took the time to prepare and submit a patch. Apologies that the subject of this code has somewhat of a barrier to entry.

Regarding the (?<!q&), these are nonsense. I removed them.

Wait, they weren’t all nonsense. I only asked about them in German, French, and Spanish specifically, and there was a reason for that: in the lists for the other languages, the (?<!q&) bit preceded a plain a. And just because you called the variable (e.g.) @SMALL_WORD_IT doesn’t mean it somehow gets applied to Italian only – all the small-word lists from all languages get mixed together at the end and therefore get exempted in every text, no matter its language. So if there is a plain a without the (?<!q&) bit in any of the word lists, then Q&A in English text will titlecase as Q&a. This also means “speakers with aux input” will titlecase as “Speakers With aux Input” just because the French list includes “aux”. So a single script that somehow works for any input text regardless of its language(s) will be much harder to write than multiple scripts that each only work for one particular language.

@Konfekt
Copy link
Author

Konfekt commented May 19, 2024

Yes, this is understood. The less a script does, the easier it is to be good at it.
Ideally, first the language had to be (m)guessed.
Even then, it will only work most of the time.
For example, the titlecased words

- to-Do
- by-Pass
- en-Route
- in-Law
- in-House
- of-Age

might not be to every one's taste, but they're rare enough not to go to lenghts about them and still find good use of titlecase.

@ap
Copy link
Owner

ap commented Oct 16, 2024

For example, the titlecased words

- to-Do
- by-Pass
- en-Route
- in-Law
- in-House
- of-Age

might not be to every one's taste, but they're rare enough not to go to lenghts about them and still find good use of titlecase.

The version I just released matches John Gruber’s latest public upstream version (from 2015 as of this writing), which contains code to handle exactly these cases.

As for making this code multi-lingual, I’m afraid I’m not interested in shipping that.

However the module version of the code already allows modifying the small-word list by changing the @SMALL_WORD array, so I don’t need to do that anyway. I wouldn’t mind adding a command line option to the program version to read a replacement small-word list from a file. Would that get you sorted?

@Konfekt
Copy link
Author

Konfekt commented Oct 16, 2024

The version I just released matches John Gruber’s latest public upstream version (from 2015 as of this writing), which contains code to handle exactly these cases.

Very laudable, thank you. Is that the code?

s{
		\b
		(?<! -)					# Negative lookbehind for a hyphen; we don't want to match man-in-the-middle but do want (in-flight)
		( $small_re )
		(?= -[[:alpha:]]+)		# lookahead for "-someword"
	}{\u\L$1}xig;

I don’t need to do that anyway.

This is open-source and voluntary effort. I am just thinking from a user perspective. How many will have the inclination, knowledge, and time to scrutinize the Perl code base and replace the array by their localized version?
I just reckon most prefer this to be done automatically by (m)guessing the language and using a suitable array;
only then they'd consider changes to the program and its parameters.

@ap
Copy link
Owner

ap commented Oct 16, 2024

Is that the code?

Yes.

This is open-source and voluntary effort. I am just thinking from a user perspective. How many will have the inclination, knowledge, and time to scrutinize the Perl code base and replace the array by their localized version?

That’s why I’m asking whether it would satisfy your use case if the program had a switch, something like --small-words-from FILENAME, which would allow you to replace the small words list from the command line.

I just reckon most prefer this to be done automatically by (m)guessing the language and using a suitable array; only then they'd consider changes to the program and its parameters.

I’ve made my decision. If Gruber were to add such a change upstream I would take it with no questions asked, otherwise it’s not going in. The many, many people who would prefer otherwise will have to either consider whether the proposed --small-words-from switch would work for them or use something else.

@Konfekt
Copy link
Author

Konfekt commented Oct 16, 2024

Makes perfect sense. Yes, --small-words-from FILENAME is a good option; since it's so short, wouldn't a string of words --small-words be fitter?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants