-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generalized Text::Hyphen #5
Comments
Playing around with Text::Hyphen, I see a couple more needs:
|
Another improvement: If a fragment (or entire word) exceeds some minimum size (say, 8 characters), force it to be split anyway, to avoid ridiculous cases where you have a very, very long string that might not even fit on a line, much less force an enormous "hole" in a line when it moves to the next line.
Even if that long Aa...ah fits on one line, if it nearly fit at the end of the previous line, it will be a ridiculous amount of stretch (Knuth-Plass) to right-justify the previous line. Even if this text is a bit contrived, you can easily end up with long unsplittable runs with things like passwords and MD5 hashes in your text. Even foreign words may be a problem if your language selection can't recognize them and there's no means to use a different Knuth-Liang pattern list on demand. First try splitting at reasonable points, such as after hyphens/dashes or between a lowercase letter and an uppercase letter (camelCase text), then between a letter and a digit (or vice-versa), then within runs of digits, and then after (or before?) punctuation. Obey the minimum prefix and suffix lengths (such as 2/3), and keep chopping until nothing is longer than 8 characters. The exact length could be a new parameter in case you want to suppress this behavior. |
Please take a look at Pull Request #1, in case you've overlooked it. The idea is that rather than releasing and maintaining a whole bunch of Text::Hyphen::XX packages, to release just the one Text::Hyphen that can either be updated manually with desired language files from the CTAN library, or go and fetch them itself, given a language option in new(). At this point I'm not sure if there are any issues with where the cache or library of patterns and exceptions should go, with respect to permissions across a wide range of platforms. I.e., can a random user trigger an action that adds files to the library in their Perl module collection?
Add: Perhaps Hyphen.pm should have a clearly marked and easily changed "where the cache is" setting, and/or an option setting for new(), where it's to write (and read from) all the pattern and exception files. This would get around worrying about some users not having permission to write to certain directories.
Then there's the issue of how you keep this library updated, should CTAN refresh a file. Certainly one way to do it is just as done now, which is to have separate Text::Hyphen::XX packages, and use the normal CPAN update mechanism. However, this means that someone will need to take on building and releasing all these packages in the first place, and keeping them all up to date! I suspect that this was the impetus for PR 1 in the first place, that users wouldn't have to wait for someone to get around to creating a package. By the way, one very useful package would be Latin! How many packages use Ipsum Lorem text for examples, and would like a way to properly hyphenate it?
Add: Perhaps there's a way that I've overlooked, but there doesn't seem to be a way to subscribe to CTAN to tell your local system to update its cache or library of hyphenation data. Maybe the best way would be to periodically run a utility to check last-modified dates on the page, and pull down anything that needs an update? On Linuxy systems, at least, this could be on a cron job (dunno about Windows). Worst case, whenever Text::Hyphen is run, it could check the date/time and run the utility for you? Anyway, there's probably not a lot of changes to such packages once they've settled down.
Anyway, I think it's time to discuss better ways of getting hyphenation support out for a wide range of languages, and CTAN appears to have done much of the work already, if we could just directly read those files (and import them easily).
The text was updated successfully, but these errors were encountered: